IN THE UNITED STATES PATENT AND TRADEMARK OFFICE 
BEFORE THE BOARD OF PATENT APPEALS AND INTERFERENCES 



First Named 






Inventor : 


Anthony Aue et al. 


Confirmation No.: 5138 


Appln. No. : 


10/813,208 


Group Art Unit: 2626 


Filed : 


March 30, 2004 


Examiner: J. Stoffregen 


For : 


STATISTICAL LANGUAGE MODEL FOR 






LOGICAL FORMS 




Docket No. : 


M61. 12-0630 





BRIEF FOR APPELLANTS 



FILED ELECTRONICALLY A UGUST 21, 2008 

Sir: 

This is an Appeal from an Office Action dated January 28, 2008 in which claims 1, 3-16 and 
1 8-40 were finally rejected. 



Contents 



Real Party In Interest 3 

No Related Appeals Or Interferences 3 

Status Of The Claims 3 

Status Of Amendments 3 

Summary Of Claimed Subject Matter 4 

I. Brief Background 4 

n. The Claimed Subject Matter 5 

A. Independent Claim 1 and Separately Argued Dependent 

Claims 3, 4, 6, 9, 15, 18, 39 and 40 5 

B. Independent Claim 23 and Separately Argued Dependent 

Claim 24 7 

C. Independent Claim 30 7 
Grounds of Rejection To Be Reviewed On Appeal S 
Argument 8 

I. Rejection of Claims 1,7, 8, 16, 18-23, 39 and 40 under 35 USC § 102(b) 8 

A. Claims 1, 7, 8, 16 9 

B. Claims 18-22 10 

C. Claim 39 11 

D. Claim 40 11 

E. Claim 23 12 

II. Rejection of 3-6, 9-15 and 24-38 under 35 USC §103(a) 12 

A. Claim 3 13 

B. Claims 4-5 13 

C. Claim 6 14 

D. Claims 9-14 14 

E. Claim 15 14 

F. Claims 24-29 15 

G. Claims 30-38 15 
Conclusion 16 
Appendix A: Claims On Appeal 17 
Appendix B: Cited References 25 
Appendix C: Evidence Appendix 26 
Appendix D: Related Proceedings Appendix 27 



-3- 



REAL PARTY IN INTEREST 

Microsoft Corporation, a corporation organized under the laws of the state of Washington, 

and having offices at One Microsoft Way, Redmond, Washington 98052, has acquired the entire 
right, title and interest in and to the invention, the application, and any and all patents to be obtained 
therefor, as set forth in the Assignment filed with the patent application and recorded on Reel 
015163, frame 0222. 

NO RELATED APPEALS OR INTERFERENCES 

There are no known related appeals or interferences that will directly affect or be directly 
affected by or have a bearing on the Board's decision in this appeal. 

STATUS OF THE CLAIMS 

I. Total number of claims in the application. 

Claims in the application are: 

II. Status of all the claims. 

A. Claims canceled: 

B. Claims withdrawn but not cancelled: 

C. Claims pending: 

D. Claims allowed: 

E. Claims rejected: 

F . C laims Obj ected to : 

III. Claims on appeal 

The claims on appeal are: 

STATUS OF AMENDMENTS 

No amendments have been filed after the final rejection. 



1,3-16 and 18-40 

2 and 17 

1,3-16 and 18-40 
none 

1,3-16 and 18-40 
none 

1,3-16 and 18-40 



SUMMARY OF CLAIMED SUBJECT MATTER 
I. Brief Background 

Machine translation systems receive a source language, translate the source language into a 
target language and provide an output in the target language. One example of a machine translation 
system uses logical forms or dependency graphs to translate. In this type of machine translation, a 
string in a source language is parsed to produce a source language logical form that is to be 
converted in to a target language logical form. A database of mappings include source language 
logical form pieces mapped to target language logical form pieces along with other metadata, such 
as sizes of mappings and frequencies of mapping are used for the conversion. Typically, the source 
language logical form piece of a mapping does not cover the entire source language logical form that 
was built from the source string. Therefore, a set of mappings must be selected and their target 
language logical form pieces combined to form a target logical form. This type of machine 
translation system first sorts by size, frequency and etc. to determine how well the source logical 
form pieces match the source logical form. Then, the sorted list is traversed in a top-down manner 
and the first set of compatible mappings is chosen. This system, however, fails test for all possible 
combinations of mapping, but just selects the first set of mappings that completely cover the source 
logical form. 

Other types of machine translation systems do not employ logical forms and instead predict 
the most likely target language string given an input string in the source language using statistical 
models. These string-based statistical machine translation systems require a large search space, 
which means the system takes to long to consider all possible strings. Oftentimes to address this 
problem, the string-based machine translation systems make simplifying assumptions. Still other 
types of statistical machine translation systems map a syntax structure in the source language to a 
string in the target language. Oftentimes the events that depend on each other are closer together in a 
syntax tree that they are in the surface string of string-base models, however, distance between 
interdependent words can still be too large and similar concepts are expressed by different structures 
resulting in poor translation performance. 



II. The Claimed Subject Matter 

A. Independent Claim 1 and Separately Argued Dependent Claims 3, 4, 6, 9, 15, 18, 
39 and 40 

Independent claim 1 is directed to a method of decoding an input semantic structure (552, 
1400) to generate an output semantic structure (556) (see FIG. 13, page 37, line 23 through page 38, 
line 6). The method includes providing a set of transfer mappings (542, FIGS. 15-21) that cover at 
least portions of the input semantic structure (552, 1400) (see page 38, lines 3-6). Each transfer 
mapping (542, FIGS. 15-21) has an input semantic side that describes nodes of the input semantic 
structure (552, 1400) and has an output semantic side that describes nodes of the output semantic 
structure (556) (see page 37, lines 14-22). A score for each of the set of transfer mappings (542, 
FIGS. 15-21) that describe a select node of the input semantic structure (552, 1400) is calculated 
(1312) using a statistical model (see page 26, line 1 through page 28, line 9). Which of the transfer 
mappings (542, FIGS. 15-21) that describes the select node and has a highest score is selected 
(1316) (see page 41, line 11 through page 42, line 3). The selected transfer mapping is used to 
construct the output semantic structure (556) (see page 47, lines 22-29). 

Dependent claim 3 depends from independent claim 1 and further describes that when the 
score is calculated (1312) the score is calculated using a target language model that provides a 
probability of a set of nodes appearing in the output semantic structure (see page 29, line 1 through 
page 30, line 15 and page 44, line 1-10). 

Dependent claim 4 depends from independent claim 1 and further describes that when the 
score is calculated (1312) the score is calculated using a channel model that provides a probability of 
an input semantic side of a transfer mapping given the output semantic side of the transfer mapping 
(see page 30, line 16 through page 34, line 16 and page 44, lines 11-15). 

Dependent claim 6 depends from independent claim 1 and further describes that when the 
score is calculated (1312) the score is calculated using a fertility model that provides a probability of 
nod deletion in a transfer mapping (see page 35, line 6 through page 36, line 15 and page 44, lines 
16-19). 

Dependent claim 9 depends from independent claim 1 and further describes that when the 
score is calculated (1312) of for each transfer mapping in the set of transfer mappings (542, 



FIGS. 15-21) that separate scores are computed for a plurality of models and the separate scores 
are combined to determine the score for each transfer mapping (542, FIGS. 15-21) that describe 
a select node of the input semantic structure (552, 1400) (see page 44, line 1 through page 45, 
line 6). 

Dependent claim 10 depends, from dependent claim 9 and further describes the plurality 
of models as including a channel model that provides a probability of an input semantic side of a 
transfer mapping given the output semantic side of the transfer mapping (see page 30, line 16 
through page 34, line 16 and page 44, lines 11-15). 

Dependent claim 15 depends from dependent claim 9 and further describes multiplying each 
score by a weight to form weighted model scores and summing the weighted model scores to 
determine the score for each transfer mapping that describe a select node of the input semantic 
structure (552, 1400) (see page 14, line 27 through page 28, line 25). 

Dependent claim 18 depends from dependent claim 16 and further describes that when a 
score is calculated for each of the set of transfer mappings (542, FIGS. 15-21) a score for a tree of 
transfer mappings is calculated by recursively calculating a score for each level of nested subtrees. 
Calculating a score for a subtree includes recursively scoring the subtrees of the subtree, calculating 
a score for the root transfer mapping of the subtree and combining the scores for the subtrees of the 
subtree with the score for the root transfer mapping of the subtree (see page 37, line 23 through page 
45, line 10). A score for the root transfer mapping is calculated and the score for each subtree is 
combined with the score for the root transfer mapping (see page 45, line 1 1 through page 46, line 5). 

Dependent claim 39 depends from independent claim 1 and further describes that when the 
set of transfer mappings are provided includes providing root transfer mappings that describe the 
select node and root transfer mappings that describe any child nodes of the select node (see page 38, 
lines 7-30). 

Dependent claim 40 depends from dependent claim 29 and further describes that when 
the score is calculated for each of the set of transfer mappings, the calculation includes 
determining (1308) whether each transfer mapping that describes the select node also describes 
any child nodes (see page 38, line 21 through page 40, line 5). The score is calculated (1312) for 
each of the root transfer mappings that describe one of the child nodes of the select node with the 



-7- 



statistical model (see page 40, lines 6-29). Which of the root transfer mappings is selected (1316) 
that describe one of the child nodes of the select node that has the highest scores (see page 40, 
line 17 through page 42, line 3). The scores of the highest scoring root transfer mappings that 
describe each of the child nodes are combined with a score of the root transfer mapping of the 
select node to find the score for each of the set of transfer mappings that describes the select 
node (See page 42, line 4 through page 47, line 21). 

B. Independent Claim 23 and Separately Argued Dependent Claim 24 

Independent claim 23 is directed to a machine translation system (500) for translating an 
input in a first language into an output in a second language (see page 21, lines 15-26). The 
system (500) includes a parser (522) for parsing the input into an input semantic representation 
(552) (see page 24, lines 4-8). The system (500) includes a search component (524) configured 
to find a set of transfer mappings (542) (see page 24, lines 9-15). Each transfer mapping includes 
an input semantic side that corresponds with portions of the input semantic representation (552) 
(see page 37, lines 14-22). The system (500) includes a decoding component (554) configured to 
score each of the set of transfer mappings (542) that corresponds with a select portion of the 
input semantic representation (552) and to select which of the transfer mappings that correspond 
with the select portion of the input semantic representation (552) that has a highest score (see 
page 24, lines 16-25 and FIG. 13). The system (500) also includes a generation component (528) 
configured to generate the output (558) based on the selected transfer mapping (see page 25, 
lines 13-27). 

Dependent claim 24 depends from claim 23 and further describes that the decoding 
component (554) scores each transfer mapping by using a plurality of statistical models (see 
page 26, line 1 through page 28, line 9). 

C. Independent Claim 30 

Independent claim 30 is directed to a method of determining a score for a word string. 
An input semantic structure (552, 1400) having a plurality of nodes that relate to an input word 
string is computed (see page 24, lines 4-8). A set of transfer mappings (542, FIGS. 15-21) are 



-8- 



obtained (see page 24, lines 9-15). Each of the set of transfer mappings including an input 
semantic side that describes at least one node of the input semantic structure (552, 1400) (see 
page 37, lines 14-22). Each of the set of transfer mappings (542, FIGS. 15-21) that describe a 
select node of the input semantic structure (552, 1400) are scored with a target language model 
that provides a probability of sequences of nodes appearing in an output semantic structure (556) 
having a plurality of nodes that relate to an output word string (558) (page 37, line 23 through 
page 47, line 29). 

GROUNDS OF REJECTION TO BE REVIEWED ON APPEAL 

I. Whether claims 1, 7, 8, 16, 18-23, 39 and 40 are anticipated by Menezes et al. (see 
Appendix B, Exhibit A). 

II. Whether claims 3-6, 9-15 and 24-38 are obvious over Menezes et al. in view of 
Brown et al. (See Appendix B, Exhibit B). 

ARGUMENT 

I. REJECTION OF CLAIMS 1, 7, 8, 16, 18-23, 39 and 40 UNDER 35 USC §102(b) 

Claims 1, 7, 8, 16, 18-23, 39, and 40 were rejected under 35 U.S.C. 102(b) as being 
anticipated by Menezes et al. (Appendix B, Exhibit A). "A claim is anticipated only if each and 
every element as set forth in the claim is found, either expressly or inherently described, in a 
single prior art reference." Verdegaal Bros. V. Union Oil Co. of California, 2 USPQ2d 1051, 
1053 (Fed. Circ. 1987). 

Appellant respectfully believes that the Examiner has failed to present a case of 
anticipation against the rejected claims. Specifically, the cited reference fails to describe all of 
the elements in claims 1, 7, 8, 16, 18-23, 39, and 40. 

Before discussing in detail the rejection under 35 U.S.C. 102(b), It is pointed out that the 
background of Appellant's patent application discusses that features of Menezes et al. reference 
are known. However, it is further pointed out that the Appellant's claims differentiate from that 
which was disclosed in the background. Like Appellant's patent application, Menezes et al. uses 
source logical forms as a basis for machine translation. To translate, however, Menezes et al. 



-9- 



uses transfer logical form mappings that are sorted by size, frequency and etc. Then the sorted 
mappings are analyzed to find the first set of compatible mappings that covers the source logical 
form. This type of system does not test all combinations of mappings as is discussed in 
Appellant's application, but only select the first set of mappings that completely cover the source 
logical form. 

A. Claims 1, 7, 8 and 16 

In regards to independent claim 1, Menezes et al. fails to describe "calculating a score for 
each of the set of transfer mappings that describe a select node of the input semantic structure 
using a statistical model." Instead, Menezes et al. describes (in paragraph 66) that "matching 
component 224 searches for the best set of matching transfer mappings in database 218 that have 
matching lemmas, parts of speech and other feature information." Menezes further describes in 
paragraph 66 and also lays out in paragraph 120 that "the set of best matches is found based on a 
predetermined metric. For example, transfer mappings having larger (more specific) logical 
forms may illustratively be preferred to transfer mappings having smaller (more general) logical 
forms. Among mappings having logical forms of equal size, matching component 224 may 
illustratively prefer higher frequency mappings. Mappings may also match overlapping portions 
of the source logical form 252 provided that they do not conflict with each other in any way. A 
set of mappings collectively may be illustratively preferred if they cover more of the input 
sentence than alternative sets." While Menezes et al. finds best matching transfer mappings for a 
source logical form, claim 1 actually calculates a score for each transfer mapping that describes a 
select node of the input semantic structure. By calculating a score, all transfer mappings that 
cover one or more nodes of the source logical form are considered. The Examiner considers the 
metrics illustrated on page 9, table of 1 of Menezes et al. as used for calculating a score. 
However, these metrics are used when finding best transfer mappings for an entire source logical 
form. There is no indication that these metrics are used to score transfer mappings for a select 
node. 

In addition, Menezes et al. fails to describe "selecting which of the transfer mappings that 
describe the select node has a highest score" as claimed in claim 1 . As previously discussed, 



-10- 



Menezes et al. determines the best transfer mappings for a source logical form. Menezes et al. 
fails to determine which score of a transfer mapping calculated for a select node is the highest 
score. The Examiner considers paragraph 121 of Menezes et al. as describing the selecting of a 
highest score of the scored transfer mapping for a select node. However, paragraph 121 merely 
states that "a subset of matching transfer mappings is selected." There is no indication that a 
highest score let alone a highest score of transfer mappings for a select node is selected. 

Still further, Menezes et al. fails to describe "using the selected transfer mapping to 
construct the output semantic structure" as claimed in claim 1. The Examiner considers 
paragraph 121 of Menezes et al. as describing this. However, paragraph 121 merely states that 
"transfer mappings in the subset are combined into a transfer logical form from which the output 
text is generated." There is no indication that the selected transfer mapping that has a highest 
score that describes a select node is used to construct an output semantic structure. 

It is respectfully submitted that claim 1 is in condition for allowance. It is respectfully 
submitted that claims 7, 8 and 16 are also in condition for allowance as depending on an 
allowable base claim. 

B. Claims 18- 22 

It is respectfully submitted that claim 1 8 is also in condition for allowance at least based 
on its dependence on allowable claims 16 and 1. However, claim 18 is in condition for allowance 
for additional reasons. In particular, Menezes et al. fails to describe calculating a score for each 
level of nested subtrees - where each subtree is calculated by calculating a score for the root 
transfer mapping of a subtree and combining the score of the subtrees of the subtree with the 
score of the root transfer mapping of the subtree. Menezes et al fails to describe that after a score 
is calculated for each level of nested subtrees, that a score is calculated for the root transfer 
mapping and then the root transfer mapping score is combined with each score of the subtree. 

Beside Menezes et al. failing to calculate a score as noted above. Menezes et al. fails to 
calculate a score as is claimed in claim 18. The Examiner points to paragraphs 92 and 93 and 
table 1 of Menezes et al. as teaching the features of claim 18. However, paragraphs 92 and 93 
discuss the alignment of logical forms for the training and building of a transfer mapping 



database. The alignment of logical forms to make the transfer mapping database is much 
different than the scoring of select transfer mappings in the approach as claimed in claim 18 to 
find highest scoring mappings. 

It is respectfully submitted that claim 18 is in condition for allowance. In addition, claims 
19 -22 are in condition for allowance at least based on their dependence on allowable claim 18. 

C. Claim 39 

It is respectfully submitted that claim 39 is also in condition for allowance at least based 
on its dependence on allowable claim 1. However, claim 39 is in condition for allowance for 
additional reasons. In particular, Menezes et al. fails to describe providing transfer mappings for 
the select node as well as any child nodes of the select nodes as claimed in claim 39. The 
Examiner points to FIG. 5B to show the features of claim 39. However, FIG. 5B illustrates the 
alignment of a logical form diagramming a Spanish sentence to a logical form of the same 
sentence in English. Menezes fails to describe the use of multiple transfer mappings for each 
parent node and child nodes. 

It is respectfully submitted that claim 39 is in condition for allowance. 

D. Claim 40 

It is respectfully submitted that claim 40 is also in condition for allowance at least based 
on its dependence on allowable claims 1 and 39. However, claim 40 is in condition for allowance 
for additional reasons. In particular, Menezes et al. fails to describe combining scores of the 
highest scoring mappings that describe the child node with a score of the select node to find the 
scores for each set of transfer mappings that describe the select node. Again, although Menezes 
et al. discusses alignment of logical forms for the training and building of a transfer mapping 
database, the alignment of logical forms to make the transfer mapping database is much different 
than the scoring of select transfer mappings as claimed in claim 40. 

It is respectfully submitted that claim 40 is in condition for allowance 



-12- 



E. Claim 23 

In regards to independent claim 23, it is respectfully submitted that Menezes et al. fails to 
describe "a decoding component configured to score each of the set of transfer mappings that 
corresponds with a select portion of the input semantic representation and to select which of the 
transfer mappings that correspond with the select portion of the input semantic representation 
has a highest score." While Menezes et al. finds best matching transfer mappings for a source 
logical form, claim 23 actually uses a decoding component that scores for each of the set of 
transfer mappings that corresponds to a select portion of the input semantic representation. The 
Examiner considers the metrics discussed in paragraph 7 of Menezes et al. as describing such a 
decoding component. However, these metrics are used when finding best transfer mappings for 
an entire source logical form. There is no indication that these metrics are used to score transfer 
mappings for a select portion of the input semantic representation. 

It is respectfully submitted that claim 23 is in condition for allowance. 

II. REJECTION OF CLAIMS 3-6, 9-15 and 24-38 UNDER 35 USC §103(a) 

Claims 3-6, 9-15, and 24-38 were rejected under 35 U.S.C. 103(a) as being unpatentable 
over Menezes et al (Appendix B, Exhibit A) in view of Brown et al.(Appendix B, Exhibit B). As 
established by Graham v. John Deere Co., 148 USPQ 149 (1968), adopted in KSR v. 
International Co. v. Teleflex Inc, 82 USP2d 1385 (2007) and as recited in MPEP §2141, obviousness is 
a question of law based on underlying factual inquiries. The factual inquiries enunciated by the Graham 
Court were as follows: 

(A) Ascertain the scope and content of the prior art; 

(B) Ascertain the differences between the claimed invention and the prior art; and 

(C) Resolve the level of ordinary skill in the pertinent art. 

It is respectfully submitted that differences between the claimed invention and the combination 
of cited references still exist and therefore the Examiner as failed to clearly articulate reasons to 
support a legal conclusion of obviousness. 

Before discussing in detail the rejection under 35 U.S.C. 103(a), It is pointed out that the 
background of Appellant's patent application discusses that features of the Brown et al. reference 



-13- 



are known. However, it is further pointed out that the Appellant's claims differentiate from that 
which was disclosed in the background and differentiate from the combination of cited 
references. 

A. Claim 3 

It is respectfully submitted that claim 3 is also in condition for allowance at least based 
on its dependence on allowable claim 1. However, claim 3 is in condition for allowance for 
additional reasons. In particular, the combination of cited references et al. fail to describe where 
calculating a score for each transfer mapping that describes a select node includes calculating a 
score with a target language model. The Examiner points to col. 8, lines 48-50 of Brown et al. as 
teaching a probability model. However, Brown et al. fails to describe that such a probability 
model is used to calculate a score of a transfer mapping as is claimed. More specifically, the 
combination of references fail to teach how a probability model in machine translation could be 
used with a logical form machine translation system in the way in which is claimed. 

It is respectfully submitted that claim 3 is in condition for allowance. 

B. Claims 4-5 

It is respectfully submitted that claim 4 is also in condition for allowance at least based 
on its dependence on allowable claim 1. However, claim 4 is in condition for allowance for 
additional reasons. In particular, the combination of cited references fail to describe where 
calculating a score for each transfer mapping that describes a select node includes calculating a 
score with a channel model. The Examiner points to col. 8, lines 52-54 of Brown et al. as 
teaching such a probability model. However, Brown et al. fails to describe that such a probability 
model is used to calculate a score of a transfer mapping as is claimed. More specifically, the 
combination of references fail to teach how a probability model in machine translation could be 
used with a logical form machine translation system in the way in which is claimed. 

It is respectfully submitted that claim 4 is in condition for allowance. In addition, claim 5 
is in condition for allowance at least based on its dependence on allowable claim 4. 



-14- 



C. Claim 6 

It is respectfully submitted that claim 4 is also in condition for allowance at least based 
on its dependence on allowable claim 1. However, claim 4 is in condition for allowance for 
additional reasons. In particular, the combination of cited references fail to describe where 
calculating a score for each transfer mapping that describes a select node includes calculating a 
score with a fertility model. The Examiner points to col. 52, lines 35-6 of Brown et al. as 
teaching such a probability model. However, Brown et al. fails to describe that such a probability 
model is used to calculate a score of a transfer mapping as is claimed. More specifically, the 
combination of references fail to teach how a probability model in machine translation could be 
used with a logical form machine translation system in the way in which is claimed. 

It is respectfully submitted that claim 6 is in condition for allowance. 

D. Claim 9-14 

It is respectfully submitted that claim 9 is also in condition for allowance at least based 
on its dependence on allowable claim 1. However, claim 9 is in condition for allowance for 
additional reasons. In particular, the combination of cited references fail to describe where 
calculating a score for each transfer mapping that describes a select node includes computing 
separate scores for a plurality of models and combining the separate score to determine the score 
for each transfer mapping. While the abstract of Brown et al. recites that two models are used to 
score a hypotheses and the two model scores are combined into one, the combination of cited 
references fail to describe that scores are calculated for each transfer mapping of a select node 
from a plurality of models. 

It is respectfully submitted that claim 9 is in condition for allowance. In addition, claims 
10-14 are in condition for allowance at least based on their dependence on allowable claim 9. 

E. Claim 15 

It is respectfully submitted that claim 15 is also in condition for allowance at least based 
on its dependence on allowable claims 1 and 9. However, claim 15 is in condition for allowance 
for additional reasons. In particular, the combination of cited references fail to describe where 



-15- 



combining separate score includes multiplying each score by a weight to form weighted model 
scores and summing the weighted model score to determine the score for each transfer mapping 
that describes a select node. The Examiner points to equation 7 of Brown et al. However, all the 
Examiner says is that this equation includes X without pointing to anywhere in the Brown et al. 
specification that describes X as being a weight, let alone that a weight is applied each of 
plurality of models scores or that the weight is used to calculate a score for each transfer 
mapping that describes a select node. 

It is respectfully submitted that claim 15 is in condition for allowance. 

F. Claim 24-29 

It is respectfully submitted that claim 24 is also in condition for allowance at least based 
on its dependence on allowable claim 23. However, claim 24 is in condition for allowance for 
additional reasons. In particular, the combination of cited references fail to describe where the 
decoding component scores each transfer mapping by using a plurality of statistical models. 
Although the abstract of Brown et al. describes the use of two different models, the combination 
of cited references fail to describe how one would use statistical model in a semantic 
representations type of translation or that you would score each transfer mapping that 
corresponds with a select portion of the representation using statistical models. 

It is respectfully submitted that claim 24 is in condition for allowance. In addition, claims 
25-29 are in condition for allowance at least based on their dependence on allowable claim 24. 

G. Claims 30-38 

In regards to claim 30, the combination of cited references fail to describe "scoring each 
of the set of transfer mappings that describe a select node of the input semantic structure with a 
target language model that provides a probability of sequences of nodes appearing in an output 
semantic structure having a plurality of nodes that relate to an output work string." While the 
combination of cited references finds best matching transfer mappings for a source logical form 
and describes a target language model, claim 30 actually scores each of the set of transfer 



-16- 



mappings that describe a select node of the input semantic structure with a target language 
model. The combination of cited references fail to describe this. 

It is respectfully submitted that claim 30 is in condition for allowance. It is respectfully 
submitted that claims 31-38 are also in condition for allowance as depending on an allowable 
base claim. 

CONCLUSION 

For the reasons discussed above, Appellant respectfully submits that claims 1,3-16 and 18- 
40 are neither described by the references cited by the Examiner nor has the Examiner presented a 
convincing line of reasoning as to why an artisan would have found the claimed invention to have 
been anticipated or obvious in light of the teachings of the reference. Thus, Appellant respectfully 
requests that the Board reverse the Examiner and find all pending claims 1, 3-16 and 18-40 
allowable. 

The Director is authorized to charge any fee deficiency required by this paper or credit 
any overpayment to Deposit Account No. 23-1 123. 

Respectfully submitted, 

WESTMAN, CHAMPLIN & KELLY, P.A. 

By: /Leanne Taveggia Farrell/ 

Leanne Taveggia Farrell, Reg. No. 53,675 
900 Second Avenue South, Suite 1400 
Minneapolis, Minnesota 55402-3244 
Phone: (612)334-3222 
Fax: (612)334-3312 

LTF/jmt 



-17- 



Appendix A: Claims On Appeal 

1 . A method of decoding an input semantic structure to generate an output semantic structure, 

the method comprising: 

providing a set of transfer mappings that cover at least portions of the input semantic 

structure, each transfer mapping having an input semantic side that describes 

nodes of the input semantic structure and having an output semantic side that 

describes nodes of the output semantic structure; 
calculating a score for each of the set of transfer mappings that describe a select node of 

the input semantic structure using a statistical model; 
selecting which of the transfer mappings that describe the select node has a highest score; 

and 

using the selected transfer mapping to construct the output semantic structure. 

3. The method of claim 1, wherein calculating a score for each transfer mapping in the set 
of transfer mappings that describe a select node of the input semantic structure comprises 
calculating the score using a target language model that provides a probability of a set of nodes 
appearing in the output semantic structure. 

4. The method of claim 1, wherein calculating a score for each transfer mapping in the set 
of transfer mappings that describe a select node of the input semantic structure comprises 
calculating the score using a channel model that provides a probability of an input semantic side 
of a transfer mapping given the output semantic side of the transfer mapping. 

5. The method of claim 4, wherein calculating a score using the channel model comprises 
normalizing a channel model score based on a number of overlapping nodes between transfer 
mappings. 



-18- 



6. The method of claim 1, wherein calculating a score for each transfer mapping in the set 
of transfer mappings that describe a select node of the input semantic structure comprises 
calculating the score using a fertility model that provides a probability of node deletion in a 
transfer mapping. 

7. The method of claim 1, wherein calculating a score for each transfer mapping in the set 
of transfer mappings that describe a select node of the input semantic structure comprises 
calculating a size score based on a number of nodes in the input semantic side of the transfer 
mapping. 

8. The method of claim 1, wherein calculating a score for each transfer mapping in the set 
of transfer mappings that describe a select node of the input semantic structure comprises 
calculating a rank score based on a number of matching binary features in the input semantic 
structure and the input semantic side of the transfer mapping. 

9. The method of claim 1, wherein calculating a score for each transfer mapping in the set 
of transfer mappings that describe a select node of the input semantic structure comprises: 

computing separate scores for a plurality of models; and 

combining the separate scores to determine the score for each transfer mapping that 
describe a select node of the input semantic structure. 

10. The method of claim 9 wherein the plurality of models comprises a channel model that 
provides a probability of an input semantic side of a transfer mapping given the output semantic 
side of the transfer mapping. 

11. The method of claim 9 wherein the plurality of models comprises a fertility model that 
provides a probability of node deletion in a transfer mapping. 



-19- 



12. The method of claim 9 wherein the plurality of models comprises a target language 
model that provides a probability of a set of nodes appearing in the output semantic structure. 

13. The method of claim 9 and further comprising: 

computing a size score for each transfer mapping that describes a select node of the input 

semantic structure, the size score based on a number of nodes in the input 

semantic side of each transfer mapping; and 
combining the size score with the separate scores for the plurality of models to determine 

the score for each transfer mapping that describe a select node of the input 

semantic structure. 

14. The method of claim 9 and further comprising: 

computing a rank score for each transfer mapping that describe a select node of the input 
semantic structure, the rank score based on a number of matching binary features 
in the input semantic structure and the input semantic side of each transfer 
mapping; and 

combining the rank score with the separate scores for the plurality of models to determine 
the score for each transfer mapping that describe a select node of the input 
semantic structure. 

1 5. The method of claim 9 wherein combining the separate scores comprises: 
multiplying each score by a weight to form weighted model scores; and 

summing the weighted model scores to determine the score for each transfer mapping that 
describe a select node of the input semantic structure. 



-20- 



16. The method of claim 1, wherein providing a set of transfer mappings comprises providing a 
set of transfer mappings arranged as a tree structure and multiple levels of nested subtrees 
comprising a root transfer mapping and subtrees, each subtree comprising a root transfer mapping, 
wherein each transfer mapping in the set of transfer mappings appears as a root transfer mapping in 
at least one of the tree and subtrees. 

1 8. The method of claim 16 wherein calculating a score for each of the set of transfer mappings 
comprises calculating a score for a tree of transfer mappings through steps comprises: 

recursively calculating a score for each level of nested subtrees, wherein calculating a score 
for a subtree comprises recursively scoring the subtrees of the subtree, calculating a 
score for the root transfer mapping of the subtree, and combining the scores for the 
subtrees of the subtree with the score for the root transfer mapping of the subtree; 

calculating a score for the root transfer mapping; and 

combining the score for each subtree with the score for the root transfer mapping. 

19. The method of claim 18 wherein computing a score for a root transfer mapping comprises 
computing a size score for the root transfer mapping based on a number of nodes in the input 
semantic side of the root transfer mapping. 

20. The method of claim 1 8, wherein combining the score of subtrees with the score for a root 
transfer mapping comprises combining size scores for the subtrees with the size score for the root 
transfer mapping by averaging the size scores for the subtrees with the size score for the root transfer 
mapping. 

2 1 . The method of claim 1 8 wherein computing a score for a root transfer mapping comprises 
computing a rank score for the root transfer mapping based on a number of matching binary features 
in the input semantic structure and the input semantic side of the root transfer mapping. 



-21- 

22. The method of claim 21, wherein combining the score of subtrees with the score for a root 
transfer mapping comprises combining rank scores for the subtress with the rank score of the root 
transfer mapping by averaging the rank scores for the subtrees with the rank score of the root 
transfer mapping. 

23. A machine translation system for translating an input in a first language into an output in 
a second language, the system comprising: 

a parser for parsing the input into an input semantic representation; 

a search component configured to find a set of transfer mappings, wherein each transfer 

mapping includes an input semantic side that corresponds with portions of the 

input semantic representation; 
a decoding component configured to score each of the set of transfer mappings that 

corresponds with a select portion of the input semantic representation and to 

select which of the transfer mappings that correspond with the select portion of 

the input semantic representation has a highest score; and 
a generation component configured to generate the output based on the selected transfer 

mapping. 

24. The machine translation system of claim 23, wherein the decoding component scores 
each transfer mapping by using a plurality of statistical models. 

25. The machine translation system of claim 24, wherein the output comprises an output 
semantic representation and wherein the plurality of statistical models comprises a target model 
that provides a probability of a sequence of nodes appearing in the output semantic 
representation. 

26. The machine translation system of claim 24, wherein the plurality of statistical models 
comprises a channel model that provides a probability of a set of semantic nodes in an input side 
of a transfer mapping given a set of semantic nodes in an output side of the transfer. 



-22- 



27. The machine translation system of claim 24, wherein the plurality of statistical models 
comprises a fertility model that provides a probability of a node deletion in the transfer mapping. 

28. The machine translation system of claim 24, wherein the decoding component scores 
each transfer mapping using a size score based on a number of nodes in an input side of the 
transfer mapping. 

29. The machine translation system of claim 24, wherein the decoding component scores 
each transfer mapping using a rank score based on a number of matching binary features 
between the input and an input side of the transfer mapping. 

30. A method of determining a score for a word string, the method comprising: 
computing an input semantic structure having a plurality of nodes that relate to an input 

word string; 

obtaining a set of transfer mappings, each of the set of transfer mappings including an 
input semantic side that describes at least one node of the input semantic 
structure; and 

scoring each of the set of transfer mappings that describe a select node of the input 
semantic structure with a target language model that provides a probability of 
sequences of nodes appearing in an output semantic structure having a plurality of 
nodes that relate to an output word string. 

31. The method of claim 30, wherein providing an input semantic structure having a plurality 
of nodes comprises providing an input semantic structure having a plurality of word nodes and at 
least one relationship node that describes a semantic relationship between words. 

32. The method of claim 30, wherein providing word nodes comprises providing word nodes 
for lemmas. ■ 



-23- 



33. The method of claim 30, wherein scoring the input word string with a target language 
model comprises scoring the input word string with the target language model in machine 
translation. 

34. The method of claim 30, wherein scoring the input word string with a target language 
model comprises scoring the input word string with the target language model in speech 
recognition. 

35. The method of claim 30, wherein scoring the input word string with a target language 
model comprises scoring the input word string with the target language model in optical 
character recognition. 

36. The method of claim 30, wherein scoring the input word string with a target language 
model comprises scoring the input word string with the target language model in grammar 
checking. 

37. The method of claim 30, wherein scoring the input word string with a target language 
model comprises scoring the input word string with the target language model in handwriting 
recognition. 

38. The method of claim 30, wherein scoring the input word string with a target language 
model comprises scoring the input word string with the target language model in information 
extraction. 

39. The method of claim 1, wherein providing the set of transfer mappings comprises providing 
root transfer mappings that describe the select node and root transfer mappings that describe any 
child nodes of the select node. 



-24- 



40. The method of claim 39, wherein calculating the score for each of the set of transfer 
mappings comprises: 

determining whether each transfer mapping that describes the select node also describes 
any child nodes; 

calculating a score for each of the root transfer mappings that describe one of the child 

nodes of the select node with the statistical model; 
selecting which of the root transfer mappings that describe one of the child nodes of the 

select node have highest scores; 
combining scores of the highest scoring root transfer mappings that describe each of the 

child nodes with a score of the root transfer mapping of the select node to find the 

score for each of the set of transfer mappings that describes the select node. 



-25- 



Appendix B: Cited References 

Exhibit A - Menezes et al., U.S. Publication No. 2003/0023422, filed July 5, 2001 
Exhibit B - Brown et al., U.S. Patent No. 5,477,451, filed July 25, 1991 



-26- 

Appendix C: Evidence Appendix 

There is no known evidence submitted pursuant to 37 CFR §§ 1 . 1 30, 1 . 1 3 1 or 1 . 1 32 or other 
evidence entered by the Examiner. 



-27- 

Appendix D: Related Proceedings Appendix 

There are no known related appeals or interferences regarding the present appeal. 



EXHIBIT A 



U.S. Publication No. 2003/0023422 
Filed July 5, 2001 
Menezes et al. 



inniiipiinniiin 

US 200:ilM2tt22Al 

w United States 

(\D Patent Application Publication m Pub. No.: US 2003/0023422 Al 

Menem et a!. <4*t t^tib. Date: Jan. 30, 2003 



(54) SCALEAMJi MACHINE TRANSJATION 
SYSTEM 



(76) Inveninrc: Arul A. MciwnM, 1VUcvt»c, WA (US); 

Slrphni IX Rkhandvm. Rc^iDond, WA 
(US>. Jeuik' 11 Wofcham. Ifcitevuc. 
WA(US); William K. IMuii, 
Redmond, Wa(1?S) 

< taar*poRdence AddrtK*: 
Jfwcph R. Kcllv 

VVKSTM AN CHAMP! J.N & KKt.LV 
Suite l6£HMnnTn»ti«fuil Onlrv 
900 South iiwwwl Avenue 
Mlnnwipnlk AIM 55402-3319 «US> 

(21) AppLNa: 0WK99J55 



(221 lUkd: JuL 5,2)001 



RHat«l U.S. Applicntinn Hani 

< <*)) Pmvwinrul appltcal inn No. <HX*295 f 33K. lited ort Jan. 
t. 2001. 

l*uhfjcatiiHi CUwmcttllnn 

(51) lnl.CI.* - - -HOW 17/2* 

(52) U^S, O. .»»„,.,:.- - 7TM/2 

(5?> ABSTRACT 

A computer implemented method IrersUleS j te.vlujj mpui 
in & ttrsl Un^ua^c tn * textual output in a second hr^yaa^c. 
Aa input logku) 'urra is generated bused cm the tcxiutl 
input. When a plurality <>$ tratxdc? roippnigs in n trui^ier 
mapping database mutch the iapwi lottiea! furro (or at 
& pojitv>ti thereof) one or moiv t-l Hunt plurality of niaicliitig 
transfer mappings is «e1cctcd ba*cd on a predetermined 
metric Textual output is generated based on the sebx'ted 
trxxisvfer U^ycj) form. 




OPERATING 


APPLICATION 


OTHER 
PROGRAM 
MOUUUS 166 


PROGRAM OAT A 


SYSnSM 164 


imfXTKAXfS 


167 



COtNTlMl 
OEVlCt' 
181 



MICROPHONE 

m 



APPLICATION 
PROGRAMS 
1«J3 



Patent Application Publication Jan. 30, 2003 Sheet ! of 8 US 2IHJ3/0023422 At 




Patent Application Publication Jan. 30, 2003 Sheet 2 of 8 US 2003/0023422 Al 




Paten! Application Publication Jan. 30, 2003 Sheet 3 of 8 US 2003/0023422 Al 




Patent Application Publication Jan. 30, 20(13 Sheet 4 of 8 



US 2003/0023422 A 1 



hacer y-" 252 



rl\J. _ OA \\Dobj clic 



\g n m boton 

e opcion 



hacer ^ 254 

.click * 



S\ ^Do6/-*- button 

F/G..3B V 0 "*""-*®*- • 



^L//?/rs-"*sbutton 
\ ^-Mods-+* option 

Vfe— — »^ppci6n 



L«i/cs-*-* 



click 256 
Cir. on \Dsub-+ you 
rllj.-3Lr \Dobi—^ button 



4ods—+- option 



Patent Application Publication Jan. 30, 2003 Sheet 5 of 8 US 2003/0023422 Al 

^300 

302 



FIG.. 4 



FORM TENATIVE 
CORRESPONDENCE 
BETWEEN 
LOGICAL FORMS 



ALIGN NODES AS A 
FUNCTION OF AT LEAST 
ONE OF ELIMINATING 

TENTATIVE 
CORRESPONDENCE 
AND/OR 
STRUCTURAL 
CONSIDERATIONS 



304 



328 



INITIALIZE SET OF 
UNALIGNED NODES 



FIG.. 6 



330 



APPLY ALIGNMENT RULES 

TO EACH UNALIGNED 
NODE IRRESPECTIVE OF 
STRUCTURE TO ALIGN 
NODES 



332 



Patent Application Implication Jan. 30, 2003 Sheet 6 of 8 US 2003/0023422 Al 



hacer 



323- 



322 



FIG..5A 




hace* 



hyperlink 



320 



FIG L_ SB 




H yperti nkj nformation 



Patent Application Publication Jan. 30, 2003 Sheet 7 of 8 US 21)03/1X123422 Al 



332 



N = 1 



334 



APPLY RULE N TO 
ALL UNALIGNED 
NODES 




YES 



N = 1 



■342 



N = N + 1 



•338 




FIG.. 7 



Patent Application Publication Jan. 30, 2003 Sheet 8 of 8 US 2003/0023422 A3 

diracrion address 

hipervinculo hyperlink 
tnformaddn 

|de e={> Hypertinkjnformation 



hipervincuk) 
350 hacer 

I Dobj 

die 



dick 



hacer =£> click 



Dobj 



352 



die (Noun) (Noun) 
hacer c==!> cliche 



die direccion address 
haoer c=j> dick 



die direcden 
hacer 

Dsub/'lxfin Dsub, 



Dobj 

(Pron) die (Noun) (Pron) 





356 



358 



(Pron) die direccion (Pron) 
(Vert)) 



| en Ofok) 
informad6n 

|de <=t> 
hipervincuJo 



under 
HypenlnkJnfonrnation 



direcd6n 

|de <=^> j 



address 



Mod 



hipervinculo hypedink 



US2(X).V(K)23422AI 



I 



Jan. .10, 2003 



SCAI.KABI.*: MACIIINK TKANS1AHON SVvVI'KM 

CROSS-ftttl&LNCETO RiilAJ UD 
APPLICATION 
[0001] This application claim* ihe benefit nf U.S. provi- 
sional patcQt application serial No. fcO£95.33a. flted Juu. 1. 
2001. 

BACKGROUND OF HIE INVENTION 

[0002] "lae present iflocntbn tclales to automated lan- 
guage translation system*. Murt specifically, itic present 
invention relate* to n stateable machine translation system 

[0003 j Machine transition systems arc s^tcim which 
recdve u ic.ttuul io^ui in oac htiguagc. translate ii to a 
*eo.»o:l language, and nmvufc a textual output in the second 
language- Cuffeni cuimrnercully available machine irgnsla- 
tKMi «*yslenis rely tm bind-tt>ded tramdcj components lhat 
ire hotb diffiojft and e*p*n*ivs \t\ cs.isiomi;>e far ft particular 
domain, and arc aki my difficult to scale to t desirable 
H/c. Thcwc disadvantages have limited their cmt effective- 
nes* and nvcnill lntftiy 

[0004] A variety of example hxsed machine translation 
system* have bven created io address these dctU'ieueies. A 
ii u i nl of auchsyslejixS we described in Ii. Otters, .«V*tViv 
Article: h'xitmpii?*ita\eMi Machine TranxSmim, Machine 
Ttactslimoo 14:113.15?. Some of these typical 

example based machine transJation research systems have 
been buill with an example ha« built from op in apprnxi- 
matcly 3*1 sentences'. Htcy have encountered a great deal of 
difficulty in sealing. Id * larger example Ini*; and the per* 
foftnane* ot" the system sutlers frvxn ibLs dt madly, 

[0005] Other uf the data driven systems described in 
Somers parse the input* from the example base nstn£ 
different parsers, based upon the particular language of hie 
input text. The ckpcndcncy *an«:iun:* resulting fn>m such 
parsing nrc thus different basal upon the iangu«$£ and ihe 
particular parsing .strategy used, 'therefore, comparing the 
depecKkitcy sirvcrurxs from one language » the otxl is 
difficult, tf iiiA ifiipj»s*.ible.. 

[OOOAj Stwh prior systems have sitso tn:>i been easily «cai- 
*Wc. example, in water to Lficrtrasc the number of 
sentence* t?wer *nd ahnw, for example, iSOO ttenknee* or «>, 
has l\xn vary difficult! This is because the priv »y^crr» 
!iav« ddlletii!)' handling nuisy input data, lu^teid, the isput 
data ha« Iven required. In b« ir» precise form, or it has been 
cleaned up, tm\ pljcftd in the propea* form> by hand. Of 
ounrfcc. !bis makca il very diffituh to dramalicalty iuereaHe 
the nurnber nf write ncc* because ol* ihe intensive 14m r 
r«qtiiicd to dean up the dai4. 

SUMMARY OF W£ INVENTION 

[0007] A computer miplemciiled met bud Crati\laUS a tex- 
tual inpul in a itini lanmiifie u> ,i texiaal output in a second 
languastc. An iupoi b ^icil hxm h geitcraled loosed on the 
textual input. When i pJuiality i>( UaaSfcr Jriapjwush in a 
transfer mapping database match the input loftici.l form (or 
*! icasi 4 ponioii tl^ctcoO une or moie i if those plumltiyof 
malchtnji iramJer rnappinpi %% *ckeic<t based nn a preifc- 
tcrmincd metric, llx-^.- innsitt miip)>ini|> are stiichcd 
tugcibei luforiii a Iraitvfei Ixtgical I^nni. Hie texiuaJtiutpul 
is licncrjned hascd on the lr*n>fer logical form. 



[0OOS] A trarvfer mapping i* illustrsirv??ly competed of a 
piif of logical fomt fragment** including a source *»d uruei 
kigkril forrn {I J"), learned from Use Uiintus dal*. At runtime 
ihe 'ocrce >idc. of Ibese mapping »» matched *£iin«t Ihe 
input. Among such nudebrd tnsppii^s, a *<\ is eljescn. Thz 
large! JiidtJiof Utese mappings in Oxrn si itched tvigellier u» 
pr»»duc« h single mm l.K Th* outpni string is then gener- 
ated IUto the target l~F. 

(0009 } lite predetermined me trie can <ate niw of u vuniciy 
of tVtfms including the number of input <*»dcs cuvcred by 
the wi-lof mapping* cut L'Clively. M/e uf Uifc difletenl transfer 
mappings ihit m*tch the input logitil form t the frequency 
with which tlsc plutaliiy of match iiig trai^tsfcr mappiiiHiS were 
fiencmed during a train irtfi phase q*cd in traininj* the 
transfer mipptflg database, fret|UcncicA with which the plu- 
rality of toaichutg transfer inspjsitig,s arc generated fnmi 
eompletety aligned logical ferrm dnrinti triinrn^ frequen- 
cies wiih sl hich the plutality of matching t/irisfcr mappings 
vwvrc generated from rr>n-mtcd par«:» nf the I raining data, 
and « score associated wiih esjch of the plurftlii y ai n>atchinj> 
transfer mappings that » indicative iif a ctinlidCttcc 4 in lie 
transfer mapping svub which it is assocuied 

(00I0J nie; present i(iveaiti*:m also be emK-$dte«f as ji 
machuse translalum .systeaii inefuding a matching component 
cnafigvred io implement the methssd rliwassed uhnve. 

[001 1 1 The prfesem in\x:nii«i can aW> be implemenu;d a* 
a inuebuit traixdaliou system that ntdudes au input geixia* 
tut generating an input dependency 1 stiuclure based On the 
ic^nul inpGt. The system also trtclodcs a transfer mapping 
database- that holds a plurality of transter mapping depcu- 
rfency structures fnrnwd ba*cd on ai leavt lOJLITtQ parallel, 
aligned, training senlenees, 'Ihe tnir^fer mapping database 
van abai be formed based on $0,000. lOO.OLKr. lS0,bXMl. or 
evvn ira excess of TRWQfiQ naming sentencx'S ! 

[001 2] In addiik*n, the presi;ni Invention can Ix; cmtsxliexl 
as a cumpuiea implemented ruc!hi>d uf trainmg a Uruisfer 
mappiriji dalahan' which includes jieneralinjt shared input 
UigLtCil iWti^ts tot bilingual input sentence*, tlx 1 - input logical 
twins being Shared acfo»k botii languages . 

[0013] |n ycl arK>tl>ct' cmh<wlijnent. the prcscni invention 
irains the- transfer mapping daubise by filtering, transfer 
mappings UilaincU inmt aligned tagka! forms, aligned 
during training, 

BRIEF llHSCRII Jf nON OV Tllii IjRAWINGS 

[0014] FIG. I te> a block diagram of «n ilbjsuative eavi- 
rtwuncitl in wttich UiC preset invent km may be used. 

[0015] FKi, 2 is « block diagram of a machfnc transUiion 
architecture in accordance with one enirv»di«iem of ihe 
present ia\H.'Utii>n. 

(001 6] PIG. 3 A is an example of u logical frwm produced 
f<w a textual iftput io a sotircc language tin this txanrpje, 
Spanish^. 

[0017] FIC;. 3B is a linked kjgkut lc<m for tlae lextual 
inpm in ltx> srxjree language, 

[0018] FIG. JC is a target logical feirtn fc Resenting a 
irauslaiiui <A the mmhoc language mput In a Uiget language 
<ynipui <i« this exauipk, tir^jlisb), 



US 2003/0013422 A 1 



Jan. 30, 2003 



[0019] Hli. 4 Ck a How diagram ilhutl rating a mclhnd for 
aligning nodes 

[0020] fill. 5 A is an example of tentative eorrespcm. 
dcnccs formed between logical form*. 

[0021] VIC 5B rS an example «4 aligned puds formed 
between ihe logical forms of Kit?, 5A 

[0022] Kit*. 6 i> a llow diagram Ulustraung application u( 
a am of rule*, to i he mcih<xi of Klfl. 4. 

[0023] KKI, 7 is a flow diagrim illustrating application of 
in ordered set of rules. 

[0024] KKi. M is .i set of transferred making* as* Stated 
with lite example <ii H<«. 5H. 

UETAI1J-D DISOUPIION tW lU.US'lttATIVli 

liMBoniMorrs 

(icnernl C>vcrvicw 

[O025] *l*itc following is a brief deseriplitai of a funeral 
purpc**: orarnpuier 120 illustrated in Kit*. 1- However, me 
computer 120 is only one example of a suitable computing 
environment and i* not intended to suggest any limitatfen a* 
to the scope of use <x functionality of the invention. Neither 
should il»c computer 120 be intiipjcicd to. having, any 
dependency or requirement relating m any one or cnirmina. 
(ton of modules £Uust ttaiodt mciem. 

[0026] "1*he invention may he <te*crihod in the general 
context of etimpuierMCkcciilafck iosra actions, such a% pro- 
gram rruxtukiK, being executed by a coropulcr. fiencnlly. 
pro gram mrxfcttes include nullifies* pfogianrs, object*. mod- 
ules, data .stnaelures, etc. thai perfnmi particular tasks or 
impknuot particular abstract dat* lypcs, "It* invention may 
also Ik* practiced in distributed computing environments 
where task* are per-fonned byremoie prixxrHsing device* that 
Arc linked through a communU"atk>«.s network. In a di.su to- 
uted computing envimnmcnt, program modules may he 
located in hoih focal and remote computer storage media 
including memory Storage device*. Tapis performed by the 
programs and modules a« <le»crih<:d he low and with the aid 
of figures. Those skilled in die art can implement the 
description and ttgjutes as processor executable EnSUucliiMis, 
which can be written on any form of a compttrcr readable 
media . 

[0027] With reference to KKi. 1 , modules of computer 120 
may include, hut are oc* limned in, a pri*!c<*inji unit 140,. a 
system memory and * system bus 141 that couples 
various system rm»duksorc\impojtenLS ioctuding llie system 
memory to the pmcexsing unit 1 40. Tlx: system bus 141 may 
he any of several types of I structure* including a memory 
bus i ir memory controller, u peripheral bus, and a tecii bus 
using any or a variety of hus architectures. By way of 
example, and not limitation, such arvm'teciutcs include 
IndmJry Standard Architecture (ISA) hits, Universal Serial 
If us (USB). Micro Channel Ajehitcciuxe (MCA) Mrs. 
IkUnoccd ISA <lilSA) Ijus. Video UleclnanieS Slatidirds 
Association (VESA) lo;:nl but, and Peripheral M<xiule Inter- 
v^jnocct (H 'I ) bus also kiiown as Me//«nute bus. Computet 
120 typjcally includes a variety of ctimpuler readable medi- 
ums. Compaicr readable nKdian» can he any avaiUfrk 
media that can be accessed by ctmiputer 120 and endude> 
both volatile and ntmvolattle media* removable and twn- 



rcmovable media. Ily way of example, and re* limitation, 
computer rcadible medium* may comprise computer stur- 
ajje media and cojnmunieauon ttxdia. C'otn^ruter skira^c 
media inehtdsK K>th volatile and nonvnUlile. removable and 
iton-ierrtovablc media inrplenxnicd tn any method or tech- 
uotogy for sioiage ul iufoimatjun such uveimipulef readalde 
instniiiians d*t* .simeiures, progmm fnoshikwVnnjpwieifw 
or <:4l>er daU. Computer aJuiage media includes, but Is not 
limited tn, RAM, ROM. lUiPKOM, f1a*h memory or Hhcr 
memory techno log y\ CD-ROM, dlsital vcrwtifc disks 
(DVD) or other optical disk Mo/aee. m&stneitc casscties. 
mnjyictie tape, mngnclk: disk storage or r4hcr maj^etic 
siorage devices, r»r any other medium which can ne used v* 
>1ure the desired information atxl whitt> can he accessed by 
computer 120, 

[002»] Commtmication medi* typically embodies com- 
puter readable msktruelums. data struct urvs. prou/im mod- 
ules c* other data in a modulated data signal J«»eh as a carrier 
«atrc or rthet transfKtfi uicchuni^ and Includes sny infor- 
matinn tlclivcry media. The term "modulated data .«ignar 
means n signal that has one « more of its charade ristics set 
oj clunked In ■vjcti a manner as to CneiiCV uiiormatk>ti in the 
xiunal. Hy way of example, arxl not limttatbn, enmmuni- 
cation media includes wired rncdt» such as a wired nctvktxrti 
or direct 'Wired cotusecttun, and wireless media such is 
ftcnusttc, IH, infrared awl other wireless media. Combina- 
tions of any of the above should also be included willun the 
snipe of computer readable media. 

[(M129] Ibe system memory 150 includes computer stor- 
media in the form or volat ile ancybr omwltilik meonevry 
.such as read unly memory (ROM) 151 and taudom access 
memory (RAM) 152. A basic input Output system 153 
(BIOS), containing ttie basic iouuiies that help to transfer 
infnrmatoiin licrwecn elements within eomputer 120, such *-« 
during slarMip, is t>-picnlly stored tn ROM 151, R/\M 152 
lypurally eomains data and or program minluks tltil are 
immediately aece.vtibk) in andAcir presently being, operated 
on by processing unit 140. By of example, and not 
limitation. KKi. 1 illustrates <5per*1imj, system 154, applica- 
tion programs 155. other program modtitcs 156, and pro- 
e/ani data 157. 

[t>030] ^Hu? computer 120 miy als£> include other rcroov- 
able*'ixmHrern<»vabie voUtile/tJwnwIatile computet st(jra|ie 
mcdii. By Viy id example only. FIG. I illustrates a hard 
disk drive 161 thai reads from or writes to nnn-remwahk:, 
nonvolatile magnetw media, a niagnvtie disk drivv 171 thai 
rvuds from oi writes to « ici>t>»valj)c. uOm'utalibe nijgixtic 
disk 172, and an nprkol dixk dris^e 175 lhat rea<U; fmm or 
writes to n removable, nom*nUiile optical disk 176 stw-'h as 
a CIJ ROM or olhCr optical nsedia. Oilier leniovuble/non* 
removable, volaiile/nrniwlatilc computer sinmge media thai 
can he used til the exemplaiy operating environment 
include, but arc not limited to, magnetic tape cassettes. Hash 
memory cards, digital versatile disks dicittd video lane, 
*4id state RAM. s:>Jid slate ROM. and trie like. Hte hard 
dikk drive 161 is ly pica fly connected in Ihe system hus 141 
ihn uigh a nm>-tcrnovahlc memory interface «jch us interface 
160, ami magnetic disk drive 171 and optical disk drive 175 
ire typically cortneded to the system bos 141 by a aw- 
aide mtraury interface, sueti as interface 170. 

[0031] 'liar drives and then aswxriated euctipuler storage 
media discussed above und illustrated in Kit;. 1, proside 



US 2003/0023422 A 1 



3 



Jan. 30, 2003 



storage of computer readable instruction*, dsu structures, 
program modate* and other dam fur the computer 120, In 
Vlti. 1. for example, hard drSfc drive 161 is illustrated aS 
coring operating system 164. applieirion programs 165, 
odicr program uukhrfes 166, and premium data 167. Note 
that t!>oe modules can cither be the same a* or dillcrcat 
from operating system I54 f application programs 155, other 
pmgrnm modules 156, and program dati 157. Operating 
*y*tem 164, application programs 165, other program mod. 
nlcs I66» and program data 167 we given different numbers, 
here to illustrate that, at a minimum, they axe diUcrcm 

COflCS. 

[0032 J A use r may enter commsnds and mfocrnatinn into 
the cotfuputer 120 through input devices such as * keyboard 
182, » microphone 185, and a printing device 1X1, such ax 
i moUNC, trackball or louch pad. CUhcr input devices [am 
shown) may include a joystick, game pud. satellite dish, 
scanner, or the like. These and other input device* arc olkn 
connected i£i the processing unit 140 through a user iopui 
interface 180 thai is coupled to the sywem bus, hat may be 
connected by other interface and ho* structures such a* a 
parallel putt, game port or a universal serial bus (USU). A 
mon ib ir I K4 or oilier type of display device is also connected 
to the' system bus HI via an interface, such as * video 
inter face 185. In addition ui ike monitor, computers may 
also im'hidc other peripheral output devices such as speakers 
187 and printer 186. which may he connected through an 
output peripheral interlace 188. 

[0(03] The computer 120 may operate in a networked 
em'trRnatesit ttsing logical connect tons to <rnc or more 
remote computers, such is a remote computer 194. The 
retook* computer 194 may be a personal computer, a hand- 
hdd device, a scrs**. a muter, a network HC a peci device 
or other cnnunoji network node, and typically include* many 
or ail of the cterncnts described above, relative to the 
computer 120. Ibe lexical wnm'Ctiiutis depicted in 1* Ui, 1 
include a local area network (I^AN) 191 and a wide irea 
net. wort; (WAN) 193, but may also include other ncrworfcs. 
Such networking enviruurnont* are commonplace in offices, 
cwerprise-wkk? compoter networks, intniocis and the Inter- 
net. 

[0034] When used or a LAN networking ertvirimmCrtl. the 
computer 120 is connected 1o the l.AN 19 1 ihnwgh a 
network imcrfaee or adapter 190. When used in a WAN 
networking environment, the computer 120 typically 
includes » modem 192 or other means fur csttbltsntng 
vxxumuaft'ulioixS uv« tlx WaN 19»V swell as the Intetuet. 
The trxxlem 192, *"hk:h may be internal «r external, may he 
OTOoected u» the *>st«m bus 141 v ia ibe user input interface 
180. cm ui he j appmpriate uicchaiit^oi. In a lictv^wkcd envi- 
ronment, prosyram mndule.«. depicted relative to the computer 
120. ui poitruns tlicieol'. may l*e su»o;d in the rcnMc 
iittfininy Atoia^c device. Hy svay "f etanipk". and n<it limi- 
tatu»o, KH.». I niuM/wc* terrene applicarum |?rc^ram?» 193 as 
roiding on tentote computer 194, It will be appreciated that 
the network connect K>ns shuwn arc exemplary »ot! «4hcr 
nKan> of establishiiKi a coftimuitwaliots link hcr*ccrt the 
curnpulerN may be used. 

[0035] "lite invent ii in h. also operational with numerous 
other general puipt^e or special purpose computing sys- 
tenrs, envmmmeni» or cortiiguratiom.. Eaamptis tit well 
kn»:«wn eotnpuiing s>^1cojs^ covixonmeriUs aodnir conligu- 



ratinm thtt may be vuiiahlc Inr use with the mvenrinn 
include, hut are not limited to. rcj»ulnr ttfcphnncA (wit horn 
any serCCn) pcrSiiiiat computers, server cufnpulcrx liamJ- 
held or laptop dcvices\ ntubipjwrcswn* sv-stermj. micropm^ 
cessot -based systems, set top boxes, poigrammiblc con- 
sumer electronics, nelWink IV.S, minict>nipulers. mainframe 
cornptitcrs^ di^ribuicd compating envfrourneots that include 
ony uf the ibm'e s>>tems or devices, xnd the like. 

Overview of Maehiix; TrartsUtioti System 

[0056] Htiur to dbieuvvittg the pjcvMii iu vent km in greater 
dclail. a brief discussion of a logica l form may be helpfld. A 
full axtd deuflcd discussion of togrCal forms and systems and 
metrutdv for ucneratrn^ them can he fnund in U.S. Pal. Nn. 
5.966j^ti w> Meidoro ct *L» issnvd Oct- 12, W fttid entitled 
Ml:! HOD AND SYSTEM 1 ; <1K COMPUIINCS SlLMAN. 
Tlf: UOTiKWL 1 THIMS rROM SYNTAX TRKES. RrteHy. 
houxv^r. togJcul forms arc generated by pcrfvxming a ittor- 
pholorriral analvAison art input ic xt to prodtice conventional 
phnse stnictorc ana \yscs augftusnced with grammatical tela- 
lions. Synlaclic analysts undergo (urtlaer pteccvang in onter 
to derive logical forms., which arc data slmctorcs that 
describe bbcled de^iiilcnctes among content words in the 
tetttual input. Urgical form* can nutmali^e certain syntac- 
tical altematrons. (e g , acltu'/i^^ix^i urxl resotw. Kmh 
inixQMMitenlial unaplutra and lurtg d&anee dependencies. 
ilhislratcd herein, for example in fKi. M, a logical form 
252 can be icpicseotcd as a graph, which hdpstnmittvcly in 
understandiug the elements of logical forms. However, as 
appreciateil by thttsc .skilled in ihe art, wiien stored <>n a 
commitcr readable medium, the logical forms may not 
readily be undersloix) as represenl ing a graph, 

[0037 J Specifically, a logical relation coriMMs of rwo 
words joined by a directional relation type, sncb as: l.ngi- 
catSubjcct, Ij^kalf ibject, 

[0038] IndircetObiecl; 

10039] l^gicalNVminatis'e, LogicalComplefneni. Logi- 
cal Age ii; 

{0040} CoAgem. Bcnciiciarv*; 

[0041] Modifier, Attribute, ScntenceModificr: 

[0042] IVepositUifialRclatiortsltip; 

[0043] Synonym, l»xjuivnlcr»kX! 9 Apportion; 

[0044] I lypcrnym, ClaiM'Ocr. StibtW 

[ 004S ) Kfcaas, Purp*v*e; 

[0046] ftpetarxtf. Kfejdal. AsiSX'l. IhrgreeMoxlilkr. 
Imcrtvifier; 

[0047] locus. Topic 

1 0048 J Duration, lime; 

[0049] Utcaliou. iVnvrty, Material. Manner, Measure. 
CoJor, Size; 

[00J0] UtarxclcriMrc. Part: 

[0051] CVmrdinate: 

[0052] User, PVi^vwr; 

[0053] Source, Ooal, Claiuw. Result; and 

(0054) Domain, 



US 2003/0023422 A I 



4 



Jan. 30, 2003 



(0055} A logical form is a data structure of connected 
logical relation* repressing a singjc textual input, Mich as 
a sentence oi part tlacreof. Tbe logical form minimally 
consist* of one logical relation and poriray* strocrura I rela* 
Uoftships (Lc. syntactic and semantic tcblioaNliips), par- 
hcutarty argument and'or adjunct relations) between imp.H- 
tnnt words in an input string, 

[0056] In one illustrative embodiment, the particular aitk 
that buikls logical form* from syntactic analyses is shared 
acros* the various Mtum- Mil target language* that the 
machine translation system thelites on. Hie shared ajchi« 
lecture greatly simpli tics the U*k of aligning logical form 
segment fcnm different bngnaixs since superficially dis- 
tinct constructions in two languages I'rcqucnily collapse onto 
similar or identical tegkal form rtyirsaitt!»(*K. Dumpies 
of lexical forms in different buguigo are described in 
greater detail below with respect lo FIGS, 3A-3C 

[0057] KHi4 2 e» a block diagram of in architecture of a 
machine transiaiinn <WKni 200 in accordance with one 
cmtoxlimem of the present invention. System 200 includes 
parsing components 204 and 206 v .statistical word nsv»cia- 
t«Mi foaming component 20tK. logical ft it in alignment com- 
ponent 210, lexical knowledge base building exxnpuracut 
212, bilingual dictionary 214, dictfonary tnergmg oompo- 
ueni 216. tiunsfcr mapping diabase 218 anil updated bilin- 
geta) dictionary 220. During training and translation run 
time, i he system 200 utilize* analysis component 222. 
matching component 224, tmrrdci component 22ft and/Or 
generation enmprtnent 228. 

[0058] fn one illustrative embodiment, a bilingual corpus 
is used to tram the system. The bilingual corpus include* 
aligned translated sentences (e,g„ sentence* in a source or 
target language. Such as llngludi. in J-U> J ojrrcspiEideiicc 
with their human created translations in the other of the 
source or target language, such as Spanish). Owing training, 
sentences arc provided from the aligned bilingual corpus 
into system 200 as source sentence* 250 (the sentences to he 
translated), jjid ns target sentences 212 (the trausbuoct of the 
source sentences!. Parsing components 204 and 206 pan* 
tl* sentences fiom the aliened bilingual corpus to prcduce 
source logical U«;ii> 234 and target logical forms 236. 

[0059] Oaring parsing, tnc words in the sentences arc 
converted to normalized won! forms (lemmas) and can be 
provided u*. statistical word association fcamtng exponent 
208. Hnih Single wsjrd and inutli*word associations are 
ilcrutively hypothesized and scored hy learning onrnpnncnt 
208 until a reliable- set of each is obtained, Statistical word 
association learning component 208 outputs learned single 
word translation pairs 238 as well as multi-won! pair* 240. 

[006HV] 'Vhc mutti-wvxd paiis 240 are provide>d u> a dictio- 
nary merge component 216, which is used to add additional 
eirJiies into l»ilui^u.al dictiocury 214 to form updated bilin- 
gaal dicii<«iary 22ti. the new entries arc njj^scniaiivi: of 
the muhi-word pairs 240. 

[0061] ITie single and multi-Word pairs 238. along Willi 
source, logical form* 234 and target logical forms 2»V> are 
jtmvided (o bugtcai form alignment <vtmu«oeni 210. Ikicfly. 
ctKnpuncnt 210 first establishes tentative enrrtspnndenevs 
between noefcs in tlx source and tunici logical forms 230 and 
236. respectively. 'Hui is done usirijj tiaisdaiiim paiis frum 
a bilin|£iial leNuonfe g, bilingual riki binary) 214. which can 



be augmented with the ntngie and multi.wnrri irauxiatinn 
pairs Z38, 240 f««n statistical wcml a?ssivutirMi learning 
ctanpoocot 208. After estiblediiny payable tvurespan- 
dencv*. align mem cxmprtncrrt 210 aligns logical form nnde«. 
accofding to both fcajejl and structural cv»nsidcratiu<is and 
creates wxml and.^ot togicil (urm lrart»tcr mappings 242. 
This a.spccr will be explained in gftater detail biTlow, 

(0062J lii^CBlly. alignmenl cumpivncol 210 <iu\\* links 
between logical (an«i usmg the bilingual di».lk«inr>- iuf«ir- 
mation 214 aud sitijd^ :md multi-word pairs 238. 240. iTie 
transfer msipping^ are optionally filtered based on a fre- 
quency with which tlvy are found in tlae source and taruict 
logical formx 234 anil 236 and are provided lit a lexical 
knowledge base building compose tit 212. 
[0063] While rtltcring is optional, in coc example, if the 
transfer mapping is not Seen ul least twice in the trairdt^j 
data, ii is no< itsed lo buikl transfer mapping database 218. 
although ur»y other desired frequency eat) l»c used as a tiller 
a% well. It should also be nnicd 1hal other filtering techniques 
can be used as ftvll. ether Uiun frequetwy of appearance. For 
example, transfer mappings van be lilieied based upon 
vklietber they are formed from eojopleie parses of the input 
.senteixvs and based upon wtcil>cr tlac logical forms used to 
create the transfer mappings are cttrnpktcly aligned. 
[0064] Compiuient 212 buiiils tiunsfer mippins: database 
218. which civitatrw transfer mapping 41 ' that hasieatty lirik 
*onfs ftadfor logiwl forms in one language. U> words and. or 
logical forms in the second language. With transfer mapping 
database 218 thus created, system 200 is utiw eontigured tor 
runtime traiwlarion^ 

[0065] During irausbtion run lime, a .vaurcc scnte nee 230. 
in be translated, i* provided to analysis component 222. 
Analysis oomp nacai 222 receives source sentence 250 and 
creates a «»uree logical form 252 based upon the source 
sctiSeiKe input. An example m;iy be helpful. In the present 
e.cimpk, .source scmencv 250 is a Spanis.li aenlence "Uaga 
click en el l>oton etc «>pcie»n" whkh i* tmnslated into English 
as -Clkk the oj>tk>n buttm*" or. liieially, "Make click in the 
but u m of option*'. 

[0066] KIG. 3A illustrates the source legiea! form 252 
generated (or source setiicuee 250 by unatysaS coanponenl 
222, The swtrce Logical fwm 252 is pawided to mau-hirtg 
eiMt)|Xincni 224. Matching component 224 uttempts&i match 
the v«irve logical fmm 252 to logical lorm.* in the transfer 
miipping database 218 in order to obtain a linked logical 
form 254. Multiple trurtvler mappings may mattti portions of 
source logical form 252- M.jiching enniponeut 224 searches 
lor the be??? set of marching transfer mappings in database 
218 (hut have matLtimg lemma*. piuU of speech, aud other 
Icaturc inlbrmalion. 'ihe *xi of best matches is found based 
on a predetermined nteu:ic. Kor example, transfer mappings 
having larger (more specific) logical forrrw nmy H lustra^ 
liwly br. preferred lo tnmsfer mappings having sntalbr 
(more general) logical (oraiS. Among mappings luvi it- 
logical fttrms of e<jual size, matching component 224 may 
iitustratively prefer higher frequency map^m>gSv Mappings 
may aK.i matdi uvei lapjttixg poituins of the bourcc logical 
form 252 provided ih4t they do noi conBJct with each other 
in any way. A set of mappings o illeelivcly niay be illustm* 
lively preferred if they ors-cr more of the input sentence than 
attcrnativc sets. Other metrics ttscd in matching the input 
logical tonn to tin tve found in dalabu»e 2 18 are discussed tn 
greater detail below with respect U> Table 1, 



US 20)3/0023422 Ai 



5 



Jan. 30, 2003 



[4)067) After a set of matching transfer mapping is found, 
rnatching component 124 ereaies links on mxfes in the 
sun roe logical Unm 252 Id copies of the oorrcSpoinJicy Uf gel 
word*, cc logical form .<*groem* received by the transfer 
mappings. io ucmcraic linked logical foira 254. FIG. 3B 
illustrates an example of linked logical furra 254 tW the 
pniseni cxampfc. I .inks for multi-word mapping arc repre- 
sented by linking the root nodes fc.g.. Hater and (.lick) uf 
the corrccponding segment*, then linking an asterisk in the 
other source nodes rsj.nkipaiin^ in the multi-word mapping 
(c.g./Usccd and flic). Sublirtks txtwceai estfrespoitding 
individual source and target node* uf jtitch a mapping (noi 
shown in Fid. 3B) may also ilhtstraiiwly he waled for use 
during transfer. 

[006X] Transfer component 226 receive linked logical 
form 254 from nukiuug. component 224 ami create* a lirgel 
logical form 256 thai will form the basis of the target 
t>unstan*x>. This is done by performing a top down traversal 
of Ihc linked Luteal form 234 in which the large! logical 
fenDSYgnwntt pointed in by links on the source logical form 
252 nodes are combined. Wbcu c\Knbirting together logical 
form segments for possibly complex mu It Uvmrd mapping*, 
the suhlmks set by matching companem 224 between indi- 
vidual mdes are used tn determine correct attachment point 1 * 
(<* modifiers, etc, Uefaub attachment point* arc used if 
needed. 

[4)069] lo cases where no applicable transfer mappings arc 
found, lite. nodes in «wav logical form 252 and their 
relit ions arc simply copied tnh> the target logical form 256. 
Deli u It single word trans litk*is may be found in 
transfer mapping database 2\H for these Rinks and uisc/tcd 
in target logical form 256. Hnwevcr, if none are fnund, 
translations can Ulusiia lively be obtained from updated 
bilingual dictionary 220. which was used during alignment. 

[0070] FIG. 3C illtwwtiws « target logical form 256 for the 
present example, tl can be seen thai the logical form 
segments from "click" to "tottcm" and from "button" to 
■Nation" were bitched logeincr from Jinked logical form 254 
to retain target logical form 256. 

[0071 J Generation component 22K i* tlhi strati vely a nite. 
based. applkatum-mdc|»fldcnl gcocmutin oomponcm that 
maps from target logical form 256 ui the target firing (Or 
optpoi target scntenoc) 25#. Generation component 22* may 
it lustra lively have ik> inlWmalitin regarding the source Ian- 
gunge of the input Ingteal forma,, anil wvtrks c.wfu*ivcly wilh 
information pus-cd to it by transfer component 226. Gcn- 
ctaiiuu component ZZK ah** illustratively uses this Informa- 
tion in exjunction wilh a mnmdinijual (e.g., for tl* largcl 
language) dbdionajy to pnvtoce target sentence 25S, Onv 
generic generation component 22H v> Uius .sufficient fen each 
language. 

[0072J 1 1 Kin the* be seen that ihe present system pajses 
bilccmaluin from vaiitius languages mlO a .iliarcd. cOinintMi, 
logical Iftrm so thai Uigicil fnrms can rs; matched among 
dilTerciit languages. The system can also utilize simple 
littering, techniques in building tht transfer mapping data- 
base io handle noisy data tnpur. Therefore,, the prcsam 
system can be automatically traiircsl usioij a wry Urv:c 
numlvr v4 serittnec pains. In one illustrative embtxliment, 
iIk number of scnteixv pairs is in c?u.>css uf l(>j««). In 
aruiltrer illustratK'e cmbadimenl, tlx uuml^Cr of ^sentence 
pairs is greater than 50,«mi to and may be inexwv* 



of ISO.Ofal. 2WMJ0rt, 5S(MW0 or even tn excess if &XifiN* 
or rj<ift»faO(> sentcrtcc jwirs, Abt»» the number of sentence 
pairs can vary («» diuerCnl languages, and need nut be 
limited to these number*. 

Logkil lorm Alignment 
(0073J YlVu 4 illustrates u metlxxJ 300 of ass»>cUiin$ 
liigical foffnis of ai least scutenCC fragments ftorn two 
diflcrenl languages, wherein ihc logical rnrmw enmprtw 
mulc^ organised in a parcot.VbikJ stnicttire. Method 300 
iochtdch avsocUlinii nvxlts of the logical forms to form 
tentative a>rrcvp»indences a«. indicated at block 302 and 
aligning nodes of the logical forms by eliminating at feast 
one of the tcnlalivH: curiespuiidcncc* and^ir suucimal ««t- 
skktruiion<> as irelkare-rl at bkx-k 304. 

[IMK74] indicated above with respect to FKr. 2. align - 
mem component 210 acces*«s bilingual dx-'tiotury 214 tn 
order to form tentative s^rcspoiaJcnccSv typically lexical 
correspondences between the togkal forms. Bilingpal die- 
liemary 214 enn b» created by nterging data frvm multiple 
MiUICCft, and can ab*u use inverted tugclHivsourcc dietu>- 

nary entries to improve coverage. As ti*;d herein, bilingual 
dictionary 214 also rcprcsc nts any <rtl>ci type of resource thai 
can provide corrt*jx>niiet>ccs between mjril*. Biltrt^ual dic«- 
Ijonary 214 can also bs augmented with iranslation corre- 
a.pondcncc!» aexjuired usiny statist teal tcctiniqueie. 

[0075] In FIG. 2, the Mjfislteat techniques are performe^l 
by component 208 . Afihtfugb tbc output froun comiNanent 
20K can be used by aligruracnt conijxiuem 210, it is not 
ncc***ary reoperation of alignment component 210, 1 (<vw- 
vvvr, one tmtsidirncni of oMnpoovnl 208 will be described 
here, bricH>v Icf the ?akc of completeness. 

[0076] Component 208 receives a paral lei, bilingual train- 
ing corpnx that is pawctl into its content wnnU. \\v>rd 
a^s* icuiiottsoores for cash pair of consent wx^rds ciNrsisiittj 
of a Word of language Ll that occurs in a sentence aligned 
in thebilin^alofwpusto ii.scmentY of language L2 in which 
the oilscr woni occqis. A pair of wotds Ls ctmskJcfcd 
"linked" in a pair «l aligned sentences it! one of the wewxls C* 
the most highly ^vsoctatcd. of all the words in its sentence, 
with the otltcr w^urd. 'Ilie eccttrrvotx: uf oocnponuus n 
hypsHhtsi/ed in the trauiing data by identifying maxima], 
connected sets of linked wonte in each pair of atigneel 
Acnterwcs in the pmeessed and senred training del a. When- 
ever one or *hcse muximaL connected sets coolaios more 
lhan one Word in either or both of the languages, the subset 
of ths wiwds tn tlut hirigiMge is hyjxuhcjy/ed as a com- 
pound. Ibt- orfdrul inmit text is rewritUiK replacing ihc 
hypoibesi/ed c»«np ttituK by >ingle, fused Uikeiis. 'Hue ;ism.»* 
cralkM score* arc then recoojpuled f<ir the compnantts 
(which have been replaced by fused tokens) and any remain- 
ing individual words in fhe irtpol teatl.'Jlic association SttJrcs 
arc again rocmnpuicd^ except thai this time:, co^cnrrenccj*. 
are taken into iccount in CXitnpulbg t!h« associalii ui sooxe:\ 
only where there J* m equally strong or uronger other 
asvicialiou in a partictthr pair of aligned ^cuteixws tn the 
traitmsg corpus. 

[0077] lYatrslalkui pairs can be iden lifted a> thiisc *x»rd 
pairs or lokeo pairs that have asx»ciiiion scoio aKive. a 
threshold, afler the final ci:«nptiutii:m of :issndati">n scores, 

[0078] Similarly, component 2(W can also at^ssvl in uren- 
lifying Iranslatrons of "c*ptoid**\ by which wc mean titles. 



US 2003/0023422 A I 



6 



Jan. 30, 2003 



or 04hrr special phraAC*, all of whose wools arc capitalised, 
(finding tniml.utons of captoids present ji special problem 
in language* l&e French or Spanish, in Which cxmvcuIumi 
dictate* thai only ihc rinfl wwdnf such an item is capital i/ed, 
vi thai the extent of the copiojd tiunsldtiuti is dtflivuli to 
determine.) Id thai embodiment, eiimpnuntH ire tirM iden- 
tified in a source language (such us English), Th» can be 
dune by finding strings of text Where the tlrst word begins 
with j capital Idter, and Ulcr tnkenr* in the cont igwitn. *trin£ 
do not begin with & tewepcase letter. Next* compounds an? 
hypothesized it) the target text try finding words tlut Mart 
wt'lh o capital letter and Nagging thi* a* tlx: possible start nf 
a ^respondin^ compound. The uraei tew is then scanned 
from kit K» right llagjytnj >uheu:qucui words thai are most 
stnwgly related to words in the identified compound in the 
source text, while aJtowtng op io i prcdctci mined ntmthcf 
(e.g.. 2) contiguous nnn-mroi highly related words, so long 
as Incy are followed by a most highly related word. 

f 007V] The left in njsiht scan can he continued utri il more 
than the predetermined number more than 2) eornijgu* 
ous words are found that are raot most highly i elated lo 
words in the identified compound in the source text. or until 
no more n**st highly related words ate present in the latgci 
text, or until punctuation is reached. 

[00X11] White the shove inscription has been provided for 
component 2QX, it is In he noted that component 20K is 
optional. 

[0UN 1 ] Ke fc n ing * gu in to aw thud 34)0 in F I ( ';. 4, gene m II y. 
farming tentative eomwpondences in step 3412 is aggres- 
sively pursued with ihc pufjuw: of attempting k> maximize 
the number of tentative opncspnndcnct!* formed between 
Ihc logical forms Accuracy nf the tentative correspondence* 
U not the most import in t criteria in step 302 because step 
304 will limber analyze the tentative rnrrtjepondence-s and 
remove those that arc determined to he incorrect. 

[0082] BHurgual dictionary 214 represent* direct transla- 
tions uwd for forming tentative cwxtspondenecji. I lowest; 
in order to form additional IcntBtivc correspondences, deri- 
vatwoal morphology v*40 also he used, for example, irans* 
laiixitcovtf nttirptoJngiea! ba.se s and derivations and b&sc and 
derived fnrm^ of translation*, can alw he izv;d to form 
tctiutivc cNjrrcspnndcnces in step 302. Likewise, tciuativc 
awfc-*pnndencex can also be formed hclwecn weic* of the 
logical forms wherein one of the mxles comprise* more 
le vies! ek'tiicnls <x woods than the other ni*de. I "or instance, 
xs i* cx>mrrMin, t»rv nf the nodes can octmprisc a single word 
in o«tte of the Ungwp(«* while the oiber noJe cxmtpriscs at 
least i wo words in the other laiigxiajje. Ctc»sely re U led 
lani»uaj»es such ax l^n^Uxh. Spanish, etc. *l?wi haw wortl 
s<inilarity (co«iUtesS) that can be used with fuz/y Jojyc to 
a-wvrlain ass* •ciet inns. 'Ibesc i-vsoctilions can then he used 
to form tsRtatix'e correAponKlcnces, 

[O0K3] At ihts puint. tl may he Ixlpfu! to cunstdCi an 
example of logical formv to he aligned. Refening <n VIC. 
3 A. Utgical fomi 320 was gcoca*«icd for the 5cfllcocc "bai 
Informacitvi del htpervincsih% ha^a elk en la direction del 
hiperviivulo". while logical Garni 322 wi* generated for the 
linglssh translation is "Under livperlmk Inlomutiun, dick 
the Hyperlink addrevi" 

[00N4] FIG. SA further illustrates each of the tentative 
wrn^Hxtdcnces 323 identified in step 302. As an cxarnpk- 



of the aRsre.isive purscil of teniaiK^conevpondences in Mcp 
3«2, in this example, ench of ihc cexninenie* of "Mipen'- 
inenlO" ujcludcs IWO diiJeicnt tentative correspundctovs 
with " Hyperlink Informal ion" and "hypcrhW* in the 
English logical fonrt 322. 

[IMIS5] Referrtng now tn step 304. the lo^icnl forms are 
alnjued. which can tnclode climinalittg otic or nwre of the 
lent&tivV cxincspj.MtdcnvCs formed in step 3tG, aod^ir which 
can be dorc a* a function of Ktrucnind enrt*yeraiioniV of the 
logical forms. In one ctabodiiactii. step 304 iiKhidcs align- 
ing »edcs nf the Logical forms us a function of a. set of rules. 
In « fttrther emhnd intent » c»eh of the rules of the set of role* 
is applied to the logical fortias in a selected order. In 
particular, the rules arc ordered lo cnr*te the mo*t unani- 
higuotps ali^nntcci^ <"te«. alignment*") lirsi. ami tlten, if 
uccoKiry, to disundifguate sut?Nequcnl node ulignmenLs. It Ls 
import am to note ihat the order that the rules are applied in 
is iKit based Ujs>n ihc slAtcture of the logics! fornix i.e.. 
inp-ckiwn pnxressiog or holtom-up pr»xx;vsin|» but rather* In 
begin with the most liri£nly.tc*lly meaningful alignments, 
wherever they appcir in the logical form. As such, ibis set 
of rules can he considered to lv appltod tn t\\c rxHies of each 
of the k£)Cal farm ni^i-linrafly as op$vc^:d to linearly 
based upon the Kiwelure of ihc logical forms. Cicncrally, the 
rules are intended tn he languu»e-neulral in onlcr thai they 
can tv univ'er>jUy applied lo any l&ngua^e. 

[0<«6J KICI, 6 geiieraUy illvstrales ypplkrai ion of ihc vet of 
rules in i he logical f<«rmx as mHhod 32K. Al slep 330. cadi 
of the nodes of lite lexical forms is vtntsuJeied lo be 
"unaligned" as oj^oscd to "aliened" I he sci of rules ts 
aj^licxl to the umtligiaed nrsles irrespceuvc <af sintcturc it 
Mep 332 k» fnrm aligned nncles. "ITienrftwc, it is desirable lo 
dlsiii^iLsh between unaligned nodes and aljgix*d ntxics. 
One K'Cfmitjui' includes assigning all of the nodes irulially to 
ihc sci of unaligned r»des, and lemovjou nc«;!es when they 
ate aligned. Ihc use of sets whether actively fbrtned in 
diftcrcpl hicatw^ns of a compuler readabU: nxsiium <t vir- 
tually formed ihn>u«h the u*c of Bi«;dcan nv** wiaied 
with die judo merely pmvuks a convenknt way in which 
to identify on aligned nodes and aligned nodes. 

[0(W7] At step 332, the set of rules is upplkd lit each of the 
unaligned nodes. FICJ- 7 ser>emat5c»lly illuvnites aspects of 
step 332 thai can be iiuplca'jcmcd to apply the set of rules. 
In one embodiment as tlcscussed ahin r c. the rule* arc applied 
in a specified order* Herein a K" is a counter that is used to 
indicate which ul the rules is applied. In the first iteration. 
step 334 apples ihc h>M rule (ct each of the unaligned nodes. 
If a ode fauiv to be Applicable to any of the unaligned rxvlcs,. 
atxiilicr rule frum the set (uid in tmt crabttdirricnl. Use next 
successive rube indicalive of a linjSuoaicaUy meaningful 
alignnicm) is then applied a> iixlicated at steps 336 and 338. 

[0088] If «l I tlx; rules of t he set of rules have been applied 
in all the rexlcs at step 340, the alignment procedure is 
|jnishe4l> 1 1 should be noicd that under some sit tui ions, not 
all of l he niulcA will he aligned. 

[0089] If a rule can be applied to a sci of tanks of Ihc 
logical forms* the nodes are idcntiliwf as being aligned *rxl 
rentovod from thesd of unaltgitcd nodes, and t|»plication of 
the rule* continues, rlowv^'er, in <me embixl intent, it Ia 
Bdvant&geous to begin again wnh the rales once some rules 
have been applied Ui obtain a inure lingu&ically meaningful 
altgnincni/nxfrefore, it can be desirable lo again apt>ly rules 



US20O3/UO23422A1 



7 



Jan. 30, 21)03 



that have previously been applied In thi^ manner, in <me 
emlxxttment, each of the nilc* nf the set of niks is «|>piii^l 
again Siting with, lor example, the llisl tulc s» indicated at 
**ep.U2. 

[0090] 'Hw following t-i an exemplary- Set of rule* fur 
aligning nodes of the logical forms. The set nT nrelcs 
presented herein is ordered based on tbc strongest to wakest 
linguistically mcartingf u I alig^raenU uf lb? nodes. AS appre- 
ciated by those skilled in ihc on, reordering at lea** some 
of the ruk-s presented herein catty not jitguificwitfy alter tbc 
quality of alignment* of tbc logical loam*. 

[0091 J I. ft' a t>j-diKctk»nally unique translation exfel* 
between a node ur *el uf nucV> in nut' logical form and a 
node or «*ct of runks in the other logical Rum, the iwo nodes, 
of sets of nodes ate aligixrd ta each otbci. A hi-dir cettonally 
unique tramlaiinn exists if a node or a set nt' nods* of one 
logical form ha* a tentative ctvrres|*indence wiiti a rxKle or 
i set o! node* in the tithe i logical form, such (hat every node 
in th* first set of oocfcs ha«» a tentative correspondence with 
every node in the second ntx of iKxks. wnd no other cor re • 
sponrlences, and every node in the second ,«el nf notice has 
a fcn'.tirivp conx^ondencu wiih every node in the first -set of 
node* and no other tturcspojidcrxvs. 

[0092) 2> A pax of parent nodes, one from each logical 
tcort, having j tentative correspondence to each oiucr. we 
aligned with each other if each child nude nf each respective 
parent node is * heady aligned Ui a child of the otixri parent 
node. 

[009J] *, A pai/ of chikl nodes one from e*ch logical 
ire alined with each other if a tcmaiivc eorrcspoo- 
dence exists between ihcm .ind if a parent node of each 
respective child node is already aligned R> a corresponding 
parent node of the otter child. 

[0094] 4, A pair of nocks one front ejeh logical form, we 
aligned to each cither if respective parent node* of the uidcs 
under consideration are aligned wiih each other and respec- 
tive child nodes aic also aitgricd u ith each wtxr. 

[0095] 5 . A node thw K a verb and j n associated chi W node 
thai is not u verb from one bigtcul form an: aliened 10 a 
Kecvitxi noikr that i* a vxrb of the other logical fnrm \i the 
tfMvUied ebild node is ftlrcudy alined with ihCvSecorKJ verb 
nude, and cither the second vert) node has «li aligned puicnl 
nridr-s nr the hrst verb nrtle and the «r«ond vcdi node ha\*e 
child nodes aligned with ctseh other. 

[0096] r>. A pair of n»xJe«, one from each logical form, 
com prising the swne p«rt-0(f-<ptech, are. aligned if> each 
other, Lf there are nu unaligned sibling nodes, and respective 
parent node* are aligped, and Hnguistie relationship* 
hetftxen the set of ix*le:» under eoiisideration and iheif 
io,pceiivc p«irenf ixtks are ihc »suic. 

[0097] 7. A pair of rxnieK. one IVnm each logical f«m, 
osimprising the s*me pari-of-speech, tia* nti^ntsd to each 
other if respective child rxttlcs are aligned witli each other 
and the lingvietic relationship het^Txn the set of nr<le* 
tinder eonsidciatioo and theitr respective child nodes we the 
*ame. 

[009K] vH. II an nnali^nexl node of one of the Jngicnl forrre* 
hjviiti unmediate nctghbof nr^es cotnprisira^ respective 
purciil uodev if any. all aligixd, and respective child ntitles, 
if any. all aliened, and if exactly one of the ironed iate mxje> 



is a non-cnmp:KtrKl wnrd aligned to a mule nf the other 
logical f<«m comprising a compound word, then i\'\£ti the 
unaligned node with flae titxle comprising (he ciimipound 
word. Note thai the immediate neighbor node* herein com- 
jwisc adjacent parent and child tKidcs hwvever the existence 
of parent and child node* is nut required, hui if day arc 
pwvm. they musi He aligned. 

£0099] «), A pair of otsles* one from each tegica) form, 
cocpprlscag pronouns, we aligned fc* caeli other if respective 
parenl ncde* are aligned with each other nnd neither of the 
node* under consideration have unaligned ahltngs. 

[0100] 10. A pair of nodes, ooc Iron) each logical rutni. 
conqsrising m^n> are aligned to each oihet if respective 
parent nods* wwnprising nouns are aliened wiih each other 
and oeiiher uS the node* uiidc/ isoasbk&ifoa have unaligned 
\ihling mwfe*. and wherern a linguistic rctwki-nship Ivtween 
each of the axles under considctalucici and their rtspctftivc 
parenl node* comprise* either a miKliftcr relationship or a 
prcprtsiiional relationship. 

[0101 j ll. A fir*4 \xt\> nivJe of one logical form it» aligned 
to a ttcxmd verb node i»f lite other logical fivni if the Inst 
wrb ru>dc has no tentative eonv?qx>ndcnee* and ha* a single 
ass*Kiiicd child vedi node that t> alrcavly 3lfeyr»cd with the 
Mroond verb node. 

(01 02] 1 2- A hr*t verb node and a \inglc, respective parent 
node of one hgfcal form is altgnod to a .second vrrb node of 
the other logical form if the tlist verb node has no tentative 
eorrcsporxlence* and has a single parent verb node that is 
already aligned with the second verb raxtc where the singje 
paieal verti node has no unaligned verb child nodes besiles 
the lirtf verti nitilc, and the second verb node has no 
unaligned vc/t> child node^ 

[0103) IX A lirM nr«le cr^niprisrng a pninoun of one 
logical torns Ls aligned to a second tnxlc of the other logical 
form if a parent node of the tind m»de i* aligned with the 
second node and ihc second node tu* no unaligned child 
itodcs. 

[0104] 1.4, A first v^rb node and a respective parent verb 
of one logical fotm is aligned to a second verti rxsJe of die 
other logical Ibrm if the first verb node has no tentative 
enrrespoisienecs and tlx- jvarem verb node is aligned with the 
scoond verb node and where the relationOiip between Ihc 
first verb and the parent verb nxxle comprise a modal 
relationship. 

[0105] Some general clKs^hcntion* of ibe rules provided 
above inetudc that one rule (rule I) is primarily based upon 
lite correspondence*, established in step 302, and in the 
emlxKliment illustrated,, it is considered to be the «rongesl 
meaningful alignment since no anvbignity i* present. Other 
rules stach as rule* 2, X II, )Z and 14 are hased on a 
crawbioaiiou of. oi a laci; of. tentative cojivspoixlenecs and 
the stnicture of the node* under ciinsadctation and previ- 
ously aligned nodes, 'fhe remaining nilcs rely sokly on 
ivlalionships between nodes undet consider4tiori wid previ- 
ously aligned nodes. Other general dnssricatmn* that can be 
drawn include that the tolcs penain to verbs, niytins aixt 
pronouns. 

[010^] Referring hack to the logical forms and tentative 
respond euues oi KUi. 5A. Ibe rules set out above can be 
a|iplied aooordin? to the method J00 nf KKI, 4 in oriler u» 



US 2003/0023422 A I 



Jan. 30, 2003 



align the node.* ax tlhiJ4 rated in FM 1. 511. 1 n thi* example . the 
two instance* of "HipcrvincubV* have iwn ambiguous ten- 
tative correspondents, and While the aimSpondence from 
-InKifmstrion" In 'll^crlinx Jntbrmalton' is unique, the 
reverse i> nut. It should also he noted that neither the 
naomilingual clu liic In lingual lexicon* ur dictionaries have 
been customised for this domain. For example, there is no 
entry in the lexicon for "HyperlinkJnfiumalkHr.'lhis unit 
has been assembled by general rules I hat link ftequence* nf 
capitalized words. Tentative lexical coocspondenfes estab- 
lished for ihisclctncoi are based on Uiitslations OiumJ for its 
individual cx>mpnncnt». 

[0107] Applying augnmcm rules as described above, 
the alignment mapping created by the rule* arc illustrated 
in FIG. 5B as doited lines 344* and arc obtained a* follows. 

[0!0tf] Iterating through the rules again, rule I applies in 
three placed* creating Alignment mappings between Mtfce- 
euin" and "address", "listed" and ~you*\ and "die" and 
"x'lick", These are the initial u besf alignments that provide 
the anchors tAnu which the method will work outwards io 
align the rc*l of tbe structure. 

[01 ttulc 2 doc* not apply to my nixies, but Rule 3 
applies* next to align tbe instance of ~hipcmneulo*\ thai i* 
tbe child of "cfrwckhT !*■ "hyperlink* whk-b t*the child of 
~»ddr.ev.". "J lie iLignriKiit »tttb>d thus kveiagcd a prevj* 
nusly created alignment (^diroccion" to "adslrejw"') and the 
structure of lite logical farm to resolve the urnbifeiuiv present 
at the lexical iced. 

[O1I0] Ktilc I applies (where previously it did not l to 
create a maity-uvooc mapping between "InformaeioV and 
- hipcrvinculn" to " I lyre rl ink^Jnfnrmatior,". The uniqueness 
condition in this rule is i»w mel because the ambiguous 
aliernalive was cleared ivay hv tbe po-ittr upplkral tun of Itulc 
3. 

[0111} Rule A ikx> not apply, but rule 5 applies to rollup 
"Iwcer" with it* object -clfcc". since the Utter is already 
aligned lu a verb. This pitsluecs ttic mwiy-UKttc alignment 
of "tracer" iind'VMc" In "dick" 

[0112] Referring, back to FIG. 7, alignment nf tbe Ingical 
li*ms is completed when the rules aic no longer *|>plieahk 
lo lay of Che node*. At Ihts print. transfer mapping* can be 
obtained by component 212. 

[0JI3J FIG. 8 illustrate* some of the transfer mapping* 
obtainable from the example of aligj>cd logical forms in 
FIG. 5B (other than transfer mapping wbiidh is included 
a* an example of a conflicting transfer mapping discussed in 
the next seel ion J. ticncrailv. a transfer mapping or simply 
"mapiiinji" is indicative of atwieUln^: a wml or Inuieal 
l^rm <si a litst language with a ctnresponding word or logical 
tvetn of a seco«id language. 'Ihe mappings can be stored on 
any computer readable medtam as explicit pointers liniing 
die wenis or lugLjl forms oi \\>c first language with the 
exirn:.«ponding wnnlt or Ingjcal form* of Ihe sernnd lan- 
guage, likewise, lite etiappings can be stored with the words 
or logical forms rallier than in a Separate database. A** 
appreciated by tbwe skilled in th« art. other techniques can 
be used to as^vialc ^v«rds or logical torms of tlx; lirst 
langua^ with went* <w logical fornw* of the *eetmd lan- 
guage, biKl h is this association, thai constitutes the map- 
pings regardless of lite sjxcilie leditnqucs used in order lo 
record this infi«nuiion. 



[01 14) lacii mapping created daring the alignment proce- 
dure cam be a base stnieturc upon whictt further mapping*, 
with additional Conlc.Vl arc aUi created. In particular, infor- 
mation can be stored no a ecxriputer readable medium to 
translate text from a first language to a second language, 
where ttac tufonnaitun Compi bcs a plurality of mappings*. 
Each mapping is indictjtive of asswiattng a word or logical 
Uxrm of the first language with a s>\ird or logical form of Ihe 
second Innfiua^c. Iltrwsver, in addition, at lea.<J iwnrw of the 
mapping eorreiipondtng tf> logical forms of the Tim lan- 
guage have varying ountext with some common clciiicuts. 
Ukc»*i«tc, at least some of the logical forms of the wennd 
language conuspoflding tn the logical forms of the first 
language may abo have vaiyhtj context with some exxnmusi 
elemenls> In othet words* at beast imm ol'lhcisnfe mappings 
obtained from the aliguiivrttl pwecdure arc u^cd u> ctc4tc 
other, ocMnpctrnj: mapping? having varying type* and 
amounts of local context. 

[0113] Referring to FIG, 8, mappings 3S0. J52, and 3M 
ittugclratc how an element of a logical form can vary. 
Mapping J50 amiprise* Ihe base or et*«re m*|^pirT« cn which 
further mapping*, ire creiled. Mapping 152 expands the core 
mapping 150 lo inehule an adtltiional lingnislie element, 
liercin the Jit eel ohjeci of ihe \sx»(d -click*, wtulc the 
mapping 354 is expanded from (he core mapping 350 such 
that the. jddrticsnal elerneni comprises an untter-specifie«;l 
node |~ * ") indiealing a put of speech but no i»pecillc lermna. 
My comparing the mappings 350. 352 and 354. as well a* 
mapping* 356 and 358* t'l can be seen that the logical torm< 
of the lirst language have comnKsti elements (parts of speech 
and.'w lcmma*;k while the logical Usrms of the sceontl 
language also ha.ve cwtmon elements. 

[0116] By stating mappings indicatrs'e t>f logictl R«mis 
with twerlipping context. <lunng lranslatk>n run lime, flu- 
ency and general applicability of tfcu* mapping* for translat- 
ing between the languages is mainlaiocd. In purtkubx, by 
having mappings updating both wiirtKand smaller logical 
forms of the languages, L'anslfction from ihe first lai^uage Ui 
Ihe second bnguage is ptxssiblc if the text to be translated 
was not Kwen in Ihe tr.jin ingdata. How'evcr» tf? Ihe extern that 
the larger context Was prCScul in lire training data, this tS aLvi 
relleeted in thc^ mapping xucti that when a mapping of larger 
cuflEexi i 5 * applicable, a mote fluent timst<Utoft heiwecn the 
Irrsl language and die seeund language can be obtained. 

[01 17] Cktierally, linguistic const rucis- are uNe<t lo pmvide 
boundaries for expanding the cure mappings to include 
acVlitiurul cwntexi. Fiii example u mapping f«>i an adjective 
can he expanded lo include ihe n«un it modilicx. f ak«u-i«!. 
a mipptog; for a verb can be expanded «> inelodc d» objecl 
as ctiirtcxt. In unolner example, tnaj>ptngs for noun Ctilloea- 
lionK are provided individually an well as 4 whole. As further 
dhrsiraicd in FIG, 8. some of tlic mappings can include 
undcrxspccificd midc* ("*"*), wherein the pirl of .speech Es 
indicjtcd but re> specific lemma is providcxJ. These types of 
mappings increase the overall applicability of die mapping* 
lor tranxlaling, from the first language in the wcnfid lan- 
guage, but also ittchttk ctMitcxt to enhance fluency of the 
transbtion obtained. 

[0118] In general, mappings I hat can be created may have 
any numfvt of wild-card or uildcispcdtkd ncdes, which 
may be undcrspc titled in a number ol dillercnt ways. For 
example, they may or nuy mn spevify a pan-of-speech. »i»J 



US 2003/0023422 A I 



9 



Jan. 30, 2003 



ibcy may specify certain syntactic nrvrmaniic features. For 
example. « pviii«rn nuy have u wild-card mule vriih ihc 
feature •"IVopcrNamc" or "I nation" nut ked, indicating ihat 
the pattern only applies when thai node is matched to -in 
input node Itui ha* Ihc ><*nrc feature. These * tkl-vards j llw 
!Jk System to hypotlxvkfrc gct»tiali/t.xl mapping Irurn spe- 
cific tUta 

Matching the Transfer Mappings lluring Run Time 

[0|H>) In addition to the infonnalion r«rutnin£ to the 
mapping* bct*-ecr) lb* words or logical forms of tbc first 
language and the. second Ur&uagc. additional m formation 
can also he suxed or used during run time translation. The 
additional information eon he used to chouse an appropriate 
set of mapping and rwwlvc cooflkis as to which mapping 
to use. i a\ (referring *o Fl<». 2) when a source logical form 
252 (or pari thereof) generated Ibr a jwwree sentence 250 
matches n*>rc than ooc sent fee side of the transfer mappings 
in the transfer mapping daiiba.se 2 IK. 

[0120] l*oi example, when the source logical form 
matches the some* side ol' multiple transfer mapping in 
database 2 1», * subset of these matching tranter mappings 
is illustratively selected snch thai all lunger mappings in 
the subset ere compatible with nne another (i.e., they arc not 
ccoflicting) and based on 4 Metric ittat is a function of how 
much uf Ihc input senterx* the transfer mappings in lire 
subset collectively cover, as well frs oibcr measures related 
to individual transfer mappiugs. Son*! such measure* are set 
out in Table I. 



TAB I £ I 



I. 


Sue «?f ls»fv«?Vr wrying frwiriwjJ 




I'W ivx*xtwt wf;er, itc \nti4at 




ciiiv.Uit <+ tj, totn la (tie tiainifg Jut 




'IT** f«a$ujrnnr wfr-i'K ln»f»iirr 




rtonimj u?» jcimwil One fully ilic«^- 




l.^aii tv< tti. 


* 


'Y&t fmiuxn:* <*"d3s **i:t tic Inmfit 




wflMre ^ttmttiJ Iron rtmiO* 




di^twc* kyiuti 




l*i fnanuit^y "*iLb wlit tit: Iwolsi 




rrvt]V(rtf »t» gmrrawii 1 ton t^giiTtl (nrrit 




(hit let^&ee hurt e thud p»i*c. 




An tixsuczcU t^V« uii*j*s»cJ 1? (Ik 




iron fir rsfpirtf by Uw Mipnvrt 







[0121] f hicv Ihc subsel of mulching transfer mappings Lv 
Peered, the transfer mappings tn the subset arc combined 
inio a transfer logical form faun whkh tbc ouipul tcxl is 
generated. 

[0122] 1 1 shoold he noted that the sunset of mitdittig 
trjuskt mapping can ccmtaui ovcriappiug transfer tnap- 
pinfi^. »i fc>ng ax thvy are oornp*iS>h:- l r or ejeampfc. the 
fo3&:»wtng k^ucnt form Can be geneiited for the Sjanish 
sentence "llaga elk en el dirccciuu de U ulUrtua" wlrich can 
he trawlated as w Oic3i tbc office *ddn;5»s"; 

[0123] l lacvr - Dobj - cliVk 

[0124 J - co - dirpcekw 

[0125] . dc . ttflcina 

[01 26] Thi» tu^ural lonn can potentially he rmtched tu all 
ai tlx: lr».nsfer nwppiiip 350 k W2 and 354 became each 



transfer mapping cvmttim thi* k^kal Inrm. The*e irmsfcr 
mappings WKtinp, htit do not conflict <hec*use all can he 
transUled as Ure sonte thing). Therefore, all may be tnClnded 
in the Mihttt nCmjtchmjt I rare for mappings j nd the transfer 
logical form cao he generated froan ihem. Ilowcvxt, if ii ts 
desired to choose onumg thein, the LvSl choice may be 
transfer mapping J52 hefatase it is ihc Urs$**. Other* coo Id 
he elu»s»cn lie a \ntvc\y of other rceMJO?* aa> well. 

[01 27 1 An example of conlliclinj^, matching I ranker map- 
ping is shown as transfer mapping whici conflicts with 
transfer mapping 352. Tberelwre, for example. Ihc logical 
Iwm; 

[0128] I'Uoer - l)obj . click 
[01 29) - en • djreccfrvn 
[0130] wtmfcl match nil nf transfer mappings 350, 352, 353 
a i id 354. I lowcvci; since transfer mappings 352 and 353 
cmiflict (because they arc translated difl erenlly) h«>lh cannnl 
he pan of the selected «ihset of matching iransfe/ mappdnc^. 
Jlius, one ts selected based »m a pt c deter rained raciric. i'or 
example i wtbset 350, 352 and 354 can be compared against 
snlisct 350, 353 *nd 354 to sec which covers the- most nf»i!cs 
in the input loa^eal furtu. collectively. Al*o. l^oth trinsfer 
mappin us352 and 353 arc the name si /e <on i he soujve sitk). 
TherCftjfC, t^lR,-r informal k«i can he used Ut distlnguisli 
helween them in xekctinK the xuhscl of matching transfer 
mappings 

[0131] A* another example of onnllfeimii transfer map- 
piftfiLS, assume? thai a number ol sentenccA processed during 
iriinlng tnclodcd the phfasc -click <st«mcthing>" thit 
aligned tr> jhe Spanl^ "hacer el it en <*nmcth ir>g>**. In other 
sentences nssume the intend; "click ■«^omethin^.>" , aligned 
10 "elejjir <.^irnething>" (literally ~«*lcd «on!krlliiii^' , K 

[0132] ITtis yklds the lolfcwittja mappings (note these 
examples are Engjlsii mapped U> Spanish whereas prvviou> 
examples havx: been Spanish mapped to (English): 



Tiji»| .- ♦ » T<*>ii ■ ■ k.-iW 

c& * 

[0133] fur the lirst cxv;. and 

XQti - * — - 'txtt - • 

[0134] In tbc second titte. 

[0135] In the pmpcr cxmiIcjos, tTAnsUiir^j **cIicV* to 
^select*" may be a Icguimjiae eariiitton. However ii itoess 
{srcSent u pmhlem in some Cases, foj exsiinple, notice tlui 
Ihc K»irce siic of hnlh tratvefen w irirntical, %n at rantrrne. 
if the tnjitit logical form matches that sotirccsidc, we are left 
with having lu choose between the lwx> different Mr gel sides, 
i.e. it twist he decided whether to translate the input *j> 
"laoer elk . . . ~ or as**ekgir . . . "7 In tl>e absence of further 
crate* t (which would likely have manifested i&sclf by caus- 
ing differing source sides of tlaer transfers) we choose 
between them based un vartuus frequency and Mudng 
metrics. 



US 2003/0023422 A 1 



10 



Jan. 30, 2003 



fOI36) Ann t her type of conflict should a bo he mentioned. 
At riMittmc, fur a given input <emcnec, there may he multiple 
matching transfer mapping tbal match diifcrenl parts ul the 
input aenacne*. Several of them can he cho*cn as the selected 
scrbset so tltat they can l*e stitched together to produce a 
trnuslcr 1.1 ? thai coven* the endue input. Irowevcr, some cif 
these matches thai art stitched together win overlap one 
another, uod some will mil. Of the ones that overlap, we can 
<mly ukc tbc»e that arc "compatible" with nnc amrtber. A* 
discussed above, by "overlap* we mean two mapping 
where at least one- nsjue of the input sentence is matched by 
hnlh mapping*. It? compatible, we meati the following: 
matches arc always compatible if ibey do 004 overlap, and 
matches that do overlap are compatible if the tajget sides 
thai correspond to the mvMs) at which they overlap arc 
identical. 

[0137} for example, if an input .sentence is w earnbiar 
vtrtfiguractnn dc scgtiridaxTftr ausUtcd as "change the seen- 
rity setting") and ti matches a tomifcr mapping as Ibllawv 



Tnbj - fljifjipmHifri » 'tofcj — urtfbg 



[0l3if] and we match another mapping of: 

[01 39 J then the two matches do mcrtap (on ~conitgura- 
citm"). but they are compatible, because they also bulb 
translate ""cnntijpjracinn" u> "setting". "Mien; lore, wv can 
^smNnc tncan m produce a transfer 1..F (or target LI") if: 

[0140] change 

[0141] Tobj setting 

|0142j Mod security 

[01 43] Fkiwcwr snpposu then: ww jlso a thiol mapping, 
of: 

UltXigirtHiut. vV,Iuc 

[0144] then this mapping which docs overlap the previous 
two at ^configurae-kin**, is not compatible, because it would 
translate "coiillgwaeiun 4 ' jo * value". tx*l "selling". There- 
fore, this mapping cannot Ik merged with the previnus iwo, 
so either this transfer mapping,, w the previous two, must be 
chosen, hut m* both at the same time. 

[0143] 'fable 1 shows examples of the information that can 
be U*;d to further deline the .subset of matching transfer 
mappings (either to choose among conHiciins nutchinsi 
tiansfer muppin^s or t4> nam>w the subset of compatible . 
malehinji trurwfer rnapprng*). Such inf<wmatKtn can inchide 
how mtich of the input seraeove h ctwvjed by the subset of 
matching, uamdet mapping (stxlleclivx'ly) and the sue of tia- 
nMpjWr^. which cart be »ss*tl»incd fnun the K^al form 



that \% matched in the translcr mapping itMlf. "I"hc k'ik of a 
logical form tiwludcs tx>th the nuxAbea of spcHtied ntsdes* as 
well a* the namlwr of Hmpjcttic itlatiurohijrs t^clwcen the 
nude*. 'IIicj^ by Way of example, the w/c of the lou^icaJ furai 
from the sitarcc side of mapping 350 equals 2, wbilc the size 
of the logical t'wm on the unjei side equals 1. In another 
example, the logical form 00 the source side of mappittg 354 
equals 4, while the tan|et side of mapping 354 <xpuls L . 

[0 1 46] r l*he i nfurmatinn for c1UK*irtg the subset of transfer 
mappings cart also include other tufurrnatjon related .to 
individunl tr«nxfcr mapping, such » the freqtaency with 
which the liyiic.il livm« in the transfer mappinn are .«een rn 
the training data. If desired, the training data can include 
~LrusletF training data, which ein l»e cuicUslcrod nx-vrc rvli-* 
able than other training data. 'Hie frequency of iIkt mapj^i^j 
as .seen in the trusted naining d^ta can be retained in 
addition, or m tlx; alter rtoilivc, to sailing the frequency of the 
mapping as seen in all of the training dala. 

[0147] Other inlbrmatinn that can be helpful in electing, 
the subset of matching trjn>fcr mappings wlwn n>atchir^ 
source logical fcinmstotiunsfcr mappings includes, the extent 
of complete -aligittvcttl <>i the logical forms in tl>c training 
data fa«m which tl>c logical forms of a transfer mapping 
have been obviincd. to other words the alignment procedure 
can fully or completely ali&t the mttkx of the larger logical 
forms, nr,<omc n»:«fcsean remain unaligned. In the example 
t>f FKi. 5H, all Ore nr'iJcs werv aligned: b.iwcver. as indi- 
cated above, this miy not always he the Case. Those map- 
pings associated with fully aligned logical Tonus may be 
cr»nsklered mnfc reliable. Of course, information for resnK'- 
ing ainflkts <\t fttrthcr defining the suhset can aim indicate 
the frequency with which the mapping w:»s geneiatexl fnim 
huh fully aligned logical fi^rms as well as partially aligned 
logical form*. 

[0I4JSJ I .ikewise, asfditional tnformation cin include the 
frequency with which the logical farms in the transfer 
mapping originated from a complete parse of the corre- 
sponding training data. In particular, the frequency with 
whkb the mapping origiruted I'mm a complete or fitted 
paaSc. Or in contrast, llt^ frequency that lire mapping origi- 
natcd Croan Only a partial par SC can be aldred f« Utei USc at 
resolving coufJicLS while ttiblchio& during txinSfalioiL 

[0 1 4 9) Attrtlief for m Of infonru liCat can iik ha tie a Score or 
value aAsigred bi r. he 1 ranker mapping hy (he- alignment 
procedure used to extract the mapping-, ton instance, the 
Score can he a tunclioii of how ".•aiOiis" (hngui%tieally 
mr-antngtul) the aliened nodes are («r how confident the 
alr'gmm-rti coui|sn>cm is ii» the t/ansfc/ mapping), ine score 
can therefore he a function of when (which iteration) ami 
which rule formed the alignment, The particular function or 
mctrtc used U> calviilaJe Ure alignment score t> not crucial, 
and any such metric can Iv ttsed to jjencnje iufonnation 
related to an alignment sccec thai can he used during run 
lime tramlatton. 

[0 1 50 J It strould l»c noted that, a lihuitgh tlsc prcscirt inven- 
tion e* rle*ciibed abo\x> primarily with respect to analy/nig» 
iligning and using togicat t'wms, ai JeaM fomc of the 
inventive concept* dt*cuNsed herein are api^licable b» other 
dependency si ructutes as well 



US 2003/0023422 A 1 



Jan. 30, 2003 



f 0151 ] Although the present invention has been rtescribt'd 
with reference to particular embodiments* worfcsrsslalkd in 
tl>c art will tecVigufrc thai changes may lur nude in in nit and 
detail without depirting from J he *p irit ami xrr*pe of the 
invent for. 

Wbnt & claimed be 

1. A computer implemented mated of translating a 
textual input b a first language to a textual output in a 
second language, comprising: 

general ing an input logical form based <*n the textual 
input; 

selecting u set d unc or more »>f a plurality of nutehiruj. 

uatbfct mappings ia a transfer mapping database that 

crutch at leust a portico of the input logical farm, based 

in) a predetermined metric: 
combining the set of transfer mappings imo a target 

logical form; and 
generating the lexical ntitpul based <*u ihc target logical 

fui III. 

2. ine method <*f claim I wheteiu tin- input lugteal form 
includes a plurality of input node* am? wherein scleetrnfi 
ct*U|tt"iscs: 

selecting the set of transfer mapping based <*i a number 
of input ry:<lc.v covered by the ul of trjn>f<rr mapping, 
cohetfivdy. 

3. Ihc mctfuid of claim 1 wherein selecting comprises: 

selecting the Set e4 tianstci mappings based on skes of the 
phirality nf matching transfer mapping. 

4. lire nxrihikd uf claim 3 wherein selecting comprises: 
selecting the set of transfer mapping as a largest of ihc 

plurality of matching transfer mspp:ry>, 
5- The method of claim I wherein selecting comprises: 

wkciin^ tins .set of transfer mappings base?? on frequen- 
cies Willi which tlx; plurality iif matching transfer 
mapping* were generated dining a training* phase used 
in training the transfer mapping database. 

6>. The method «J claim I wlaereitt selecting comprises: 

.selecting tlte m.1 of trim, lei mapping* based on Irexmen- 
cics wiih which the plurality of matching transfer 
mappings were generated from completely aligned 
abj^ical forms A\ifm$, a traimna phase used in training 
[he transfer mapping database. 

7. Tic method of claim 1 wljeretn selecting emprises: 

Selecting tlte se1 of transfer mapping* based on frequen- 
cies with which ihc plurality of matching transfer 
mappings were getter atod frt>ro partially oligrxsl logical 
form* during a training phase u&ed in training the 
transfer mapping database, 

8. 'ITie method tsf claim 1 wherein selecting comprises: 

sclceimg the set of transfer map|)ings based rm Crcqiicn- 
cieit w&h which the pluralny nf nulchinj; transfer 
mapputgs were generated bom nun- fined parses of 
training data u*c?l Io generate V.^Jcal I'm my* duni^ u 
train int: phase «*ed in traininj: the Iransfer mapping 
database. 

9. 'Yhv met hud d claim I wherein selecting emprises: 

scfeclmj: Ihc >*-'! of tramJcr mappings bawd on a wrnre 
associated with each of the plu/alitvof manning trans- 
fer mappings. tliciH-xirc being niiUcativeof a confidence 
in Ihc transfer maj>ping with which it is ajsswiated. 



10. Ihc method of claim I wherein ectmbming the sei of 
transfer rnappinjis mmprises: 

generating a linked toxical furm. inelicaiivc of links 
between the input iygjcal furni and logical furai* in the 
transfer mapping database, based on the set of transfer 
mappings. 

11. The met b id of cla im 10 therein ci»mbtning further 
CTim prises: 

generating a targei logical furm based on the linked 
logical form. 

12. The method nf claim II wherein generating the t«n^et 
logical form omup/fecs'! 

ue\e5sing a hilinguil dictionary based on uords in the 
linked logical form. 

13. The methtxl ol claim 1 1 wkeretn generating ttac textual 
output comprises: 

generating the lextual nutpul based on ihe target l«»gicAjl 
lonrn. 

14. Ihe ntclkfexl of claim I wherein selecting comprises: 

selecting a» Ihc m:1 » plurality ul overlapping, matclmsj 
traosftr mapping 

15. ITie method of claim M wherein combining ciim- 
priscs! 

combining the pluralily of owrlippmg. matching tttnsfcr 
mapping* to obtain the target logical form. 

16. A machine Ixar&iafiuB system lor ltarx»laiing a textual 
input in a linn language Co a textual oulput in a scatr*! 
langniigc. the machine tramlation system compfising; 

a maichtng con^wocnt configured to match input logical 
form* generated baxed nn the textual input tn a set of 
one or more of a plurality of matching transfer map* 
pittas ui a transfer mapping database dial match at least 
a portion of the input logical frirm. based on a prede- 
termined mctrw: ami 

a gcneiutioacOin|xmenl cooJtgurceltogeueralv the textual 
output based on the sclccied transfer mapping*. 

17. "lie machine trarplation system <*f claim In and 
further composing; 

a logical form generator icccivuig tlx: wxiuat input and 
generating the tcifyut logical Hints* based oat the textual 
input. 

IK. A mrjchtne translation system for tr*nsl»iin§ n textual 
input in a Ibst language to a textual output in a second 
language, the machine lr;inslation system ctmiprfciimj; 

an input generator generating an input dependency Slnae* 
njre. based on the 1exm.il input; 

a transfer mapping database including a phtraltly of 
!nm»fvr mapping dependency structures formed bxied 
on at least ten thousand parallel, nligtxd, training 
seme necac 

a matching wtfttponcni configured to receive ihv input 
dependency xtmeture and match it against a matching 
set of one or more of the transfer mapping dependency 
alructurea tn tlx' transfer mapping, database; and 

a gciKratiooeomponcm configured to generate tlx- textual 
output based on the matching transfer mapping depen- 
dency structure. 



US2UO3/0O23422A1 



12 



Jan. 30, 2003 



19. The machine tran*lHKH) system nf claim 18 wherein 
the trartsfer rrumpngdaumaac includes a plurality of transfer 
mapiniig dependency sUtiefurvs formed ha^Cd on at Uracil 
\\l\y thousand parallel, aligned, training sentences. 

20. I Ijc machine Uuitsltlkai system of claim \*> wherein 
tlx* Uan.sk i mapping database include* a plurality of transfer 
mapping dependency structure* formed based on At feast <mc 
hundred thousand parallel, aligned. Ir lining scnlenec.v 

21. The machine translaiion system of claim 19 wherein 
the transfer mapping database includes a plurality of transfer 
mapping <lcf!*i"fctcftc>' structures formed based on til least one 
hundred" eighty thousand pitralkl, aligned, training sen- 
Ir.nce*. 

22. Tbe machine translation system of claim 19 wherein 
the transfer mapping database include* a plurality of transfer 
mapisii^ ikpcndcncy Mrueiurts formed biscd on at feast ttfo 
hundred thousand parallel, aligned, training sentences. 

23. A compu rcr implemented method of training a transfer 
mapping database, cumprtiin^r 

receiving 4 plurality of parallel, aligned, pairs of input 
sentences in two different languages; 

generating input logical form* for ihc input sentences in 
bulb languages* the input lineal lorn*s being shared 
jctoks both Inngva^cH; and 

u lining ibe transfer mapping da labuse based on die input 
logical f nrmv 

24. Tlx* method of claim 23 whcicin 11 awing curaprisc*: 

aligning the input lugical forms to obtain Iraasfcr map- 
pings: and 

mining lire transfer mapping database luted on the 
transfer mappings. 

25. The method of claim 24 whejrein training tlx? transfct 
mapping, database bused nn transler mappings comprises: 

training the traiisfcr mapping database based only cnt 
transfer mappings obtained from Ihc aligned tngtcal 
forms at leasi. a jircdeicfminod itosttinld tmml*sr of 
limes.. 

26. The method of claim 25 wherein the. predetermined 
threshold number of times comprises two times. 



27. The methnd of claim 23 wherein receiving comprises 
receiving at fcnst ten ihmisitid parallel,, aligned, training 
.■iCnttroeeS. 

28. The melhrd of claim 23 wherein receiving comprise* 
receiving « kast fifty tbiaisand parallel, aligned, training 
.sentence*. 

29. The method of claim 23 wherein receiving comprise* 
receiving tl least une but*bed thousand pa talk I. aligned, 
twining sentences, 

30. The method of e&itm 23 wherein receiving comprise* 
receiving at least one hundred eighty thousand parallel, 
aligned, training sentence*. 

31. The method of datm 23 wbeiciti receiving comprises 
receiving aj Veasi two hundred thousand parallel, aligned, 
training scirteoces. 

32. A computer implemented method of u ainUig a transfer 
mapping database, comprising: 

receiving a plurality of parallel, aligned, pairs of input 
wmencfcs in iwii different lafltiMgc*; 

generating input tugteal fame* for tlx' input scmence* at 
K<th languages; 

aligning the input togrVal foiirn* to obtain Irarrsfer map- 
pings; 

filtering the trazudcr mappings, obtained; and 

1 tauuit'g (he naosfer mapping database based only on the 
Ulcctcd tiaiKfc/ mappings. 

33. Tlv mclhivl nt" claim ,V2 wherein Itlttzring Ihv transfer 
mappings cximpt hes: 

filtering transfer mapprngN ohlaitKd fnwn the aligned 
logical lo«ns less than at least a pretermitted ilarcsh- 
titd numlx't of times. 

34. The method of claim 33 wherein filtering the tnnsff.r 
mappings cvirnprr«:s: 

filtering trtiwftM mappings obiaioed from the ftligned 
Itigieal forma tcvt than al fcuil tu\> limes. 

♦ ♦ ♦ » ♦ 



EXHIBIT B 

U.S. Patent No. 5,477,451 
filed July 5, 2001 
Brown et al. 



United States Patent m 

Brown et al. 



US005477451A 
[li] Patent Number: 
[45] Date of Patent: 



5,477,451 
Dec. 19, 1995 



[54] METHOD AND SYSTEM FOR NATURAL 
LANGUAGE TRANSLATION 

[75] Inventors: Peter F. Brown, New York; John 
Cocke, Bedford; Stephen A Delia 
Pietra, Pearl River, Vincent J. Delia 
Pietra, Blauvelt; Frederick Jelinek, 
Briarcliff Manor; Jennifer C Lai, 
Garrison; Robert L. Mercer, Yorktown 
Heights, all of N.Y. 

[73] Assignee: International Business Machines 
Corp., Yorktown Heights, N.Y. 



[21] Appl. No.: 736,278 



[22] 

[51] 
[52] 
[58] 



[56] 



Filed: 

Int. CI. 6 

U.S. CI 

Field of Search 



JuL 25, 1991 



GC6F 17/20; G06F 17/27 

364/419.08 

364/419, 419.02, 

364/419.08, 419.16; 381/43, 51 



References Cited 

U.S. PATENT DOCUMENTS 

4,754,489 6/1988 Bokser 382/40 

4,829,580 5/1989 Church 381/52 

4,852,173 7/1989 Bahl et al 381/43 

4,882,759 11/1989 Bahl et aL 381/51 

4,984,178 1/1991 Hemphill et al 364/419 

4,991.094 2/1991 Fagen et al 364/419 

5,033,087 7/1991 Bohl et al 381/43 

5,068,789 11/1991 Van VTiembergen 364/419 

5,109,509 4/1992 Katayama et al 395/600 

5.146,405 9/1992 Church 364/419 

5,200,893 4/1993 Ozawa et aL 364/419 



FOREIGN PATENT DOCUMENTS 

0327266 8/1989 European Pat. Off. . 

0357344 3/1990 European Pat. Off. . 

0399533 11/1990 European Pat Off. . 

WO9010911 9/1990 WO . 



OTHER PUBLICATIONS 

"Method for Inferring Lexical Associations from Textual 
Co-Occurrences", IBM Technical Disclosure Bulletin, vol. 
33 IB, Jun. 1990. 

'Tagging Text with a Probabilistic Model", by Bernard 
Merialdo, Proceedings of the International Conference on 
Acoustics, Speech and Signal Processing, Paris, France, 
May 14-17, 1991. 

(List continued on next page.) 

Primary Examiner— David M. Huntley 

Assistant Examiner— Omits R. Kyle 

Attorney, Agent, or Firm—Siemt, Kessler, Goldstein & Fox; 

Robert P. Tassinari 



[57] 



ABSTRACT 



The present invention is a system for translating text from a 
first source language into a second target language. The 
system assigns probabilities or scores to various target- 
language translations and then displays or makes otherwise 
available the highest scoring translations. The source text is 
first transduced into one or more intermediate structural 
representations. From these intermediate source structures a 
set of intermediate target-structure hypotheses is generated. 
These hypotheses are scored by two different models: a 
language model which assigns a probability or score to an 
intermediate target structure, and a translation model which 
assigns a probability or score to the event that an interme- 
diate target structure is translated into an intermediate source 
structure. Scores from the translation model and language 
model are combined into a combined score for each inter- 
mediate target-structure hypothesis. Finally, a set of target- 
text hypotheses is produced by transducing the highest 
scoring target-structure hypotheses into portions of text in 
the target language. The system can either run in batch 
mode, in which case it translates source-language text into 
a target language without human assistance, or it can func- 
tion as an aid to a human translator. When functioning as an 
aid to a human translator, the human may simply select from 
the various translation hypotheses provided by the system, 
or he may optionally provide hints or constraints on how to 
perform one or more of the stages of source transduction, 
hypothesis generation and target transduction. 

42 Claims, 54 Drawing Sheets 







5,477,451 

Page 2 



OTHER PUBLICATIONS 

"Word-Sense Disambiguation Using Statistical Methods", 
by Peter F. Brown et al., appearing in the Proceedings of the 
29th Annual Meeting of the Association for Computational 
Linguistics, Jun. 1991, pp. 264-270. 
"Lex — A Lexical Analyzer Generator", M E. Lesk, Com- 
puter Science Technical Report, No. 39, Bell Laboratories, 
Oct. 1975. 

"Self Organized Language Modeling for Speech Recogni- 
tion" by F Jelinek, Language Processing for Speech Rec- 
ognition, pp. 450-506. 

"A TYee-Based Statistical Language Model for Natural 
Language Speech Recognition" by Lalit R. Bahl et al., IEEE 
Transactions on Acoustics, vol. 37, No. 7, Jul. 1989, pp. 
1001-1008. 

'Trainable Grammars for Speech Recognition" by James K. 
Baker, Speech Communications Papers Presented at the 
97th Meeting of the Acoustic Society of America, 1979, pp. 
547-550. 

"Interpolated Estimation of Markov Source Parameters from 



Sparse Data" by F. Jelinek and R. L Mercer, Workshop on 
Pattern Recognition in Practice, Amsterdam (Netherland): 
North Holland, May 21-23, 1980. 
"An Inequality and Associated Maximization Technique in 
Statistical Estimation for Probabilistic Functions of Markov 
Processes", by Leonard E. Baum, Inequalities, vol. 3, 1972, 
pp. 1-8. 

"Aligning Sentences in Parallel Corpora" by Peter F. Brown 
et al., appearing in the Proceedings of the 29th Annual 
Meeting of the Association for Computational Linguistics, 
Jun. 1991, pp. 169-176. 

"Partial Traceback in Continuous Speech Recognition" by 
James C. Spohrer et al., Proceedings of the IEEE Interna- 
tional Conference on Acoustics, Speech and Signal Process- 
ing, Paris, France, 1982. 

"Deriving Translation Data from Bilingual Texts", Catizone 
et al., Proceedings of the First International Acquisition 
Workshop, Detroit, Mich., 1989. 

"Making Connections", by M. Kay, appearing in ACH/ 
ALLC '91, Tempe, Ariz., 1991, p. 1. 



U.S. Patent Dec. 19, 1995 



Sheet 1 of 54 



5,477,451 




Computes Pr(F | E) 



102 



104 




Figure 1 



U.S. Patent 



Dec 19, 1995 



Sheet 2 of 54 



5,477,451 




French Transducer 
Transforms F into an 
intermediate structure F ' 



201 



Decoder 

Given F ' finds E'for which 
Pr(E') Pr(FlE')is large 




English 

Language Model 
Computes Pr(E ') 



204 



English to French 
Translation Model 
Computes Pr(F \ E f 



205 



202 



English Transducer 
Constructs E from an 
intermediate structure E ' 




203 



207 



Figure 2 



U.S. Patent Dec. 19, 1995 Sheet 3 of 54 



5,477,451 




statistical transfer from 

French intermediate 
structures to interiingua 




French intermediate 
structures 








transducer from French 

text to French 
intermediate structures 







302 



301 




French text 



statistical transfer from 
interiingua to English 
intermediate structures 




English intermediate 
structures 




308 



309 







transducer f 
intermediate 
to Enalish 


rom English 
i structures 
text 


> 






English text 



Figure 3 



U.S. Patent 




Sheet 4 of 54 5,477,451 



401 



403 



capture next section 
of source text 



404 



translate source text 
into target language 



405 



display target text 




Figure 4 



U.S. Patent 



Dec 19, 1995 



Sheet 5 of 54 



5,477 




Figure 5 



U.S. Patent Dec. 19, 1995 Sheet 6 of 54 



5,477,451 




Figure 6 



U.S. Patent Dec 19, 1995 Sheet 7 of 54 



5,477,451 



707 




source text 



04 



70 



transduce 
source text 



702 



generate target 
structure hypotheses 
and scores 



70; 







select hyp< 
(he greater 


Dthesis with 
it score 




704 



transduce 
target structure 



target-structure 
language mode! 





target-structure 




to 




source-structure 




translation model 



70S 



706 



708 




target text 



Figure 7 




Figure 8 



U.S. Patent Dec. 19, 1995 Sheet 9 of 54 5,477,451 




constraints 



\ 



906 



901 



902 




source text 



transduce 
source text 



generate target 
structure hypotheses 
and scores 



select set of hypotheses 
with greatest scores 



02 




transduce target 
structures 



903 



;904 



705 



target-structure 
language model 





target-structure 




to 




source-structure 




translation model 



7 



706 




Figure 9 



U.S. Patent 



Dec. 19, 1995 



Sheet 10 of 54 



5,477,451 



1002, 



\ 



1001 N 



1003^ 



translation 
system 



data 
storage 



operating system 



1007, 



RAM 



CPU 



1005 



\ 



input /output 
interface 



1006 



1012 



101 



1009^ 



riO 



scanner 



101Q 




\ 



101 



\ 



printer 



external 
network 



1014 



Figure 10 



U.S. Patent Dec. 19, 199S Sheet 11 of 54 5,477,451 



110U 



1102, 



1103, 



1104, 



1105, 



1106, 




source text 



tokenize raw text 






determine words 
from tokens 






annotate words with 
parts-of-speech 






perform syntactic 
transformations 






perform morphological 
transformations 






annotate morphs 
with sense labels 




„701 



Figure 11 



U.S. Patent 



Dec. 19, 1995 



Sheet 12 of 54 



5,477,451 



1201 



1202. 




token sequence 


















determine true-case 
labels 












\ 


perform specialized 
transformations 


















cased word sequence 




-1102 



Figure 12 



U.S. Patent 



Dec. 19, 1995 



Sheet 13 of 54 



5,477,451 



1301. 



1302, 



1303« 



word sequence 
annotated with 
part-of-speech 
tags 



perform question 



inversion 



perform do-not 
coalescence 



perform adverb 
movement 



transformed 
word sequence 
annotated with 
part-of-speech 
tags 



.1104 



Figure 13 



U.S. Patent 



Dec 19, 1995 



Sheet 14 of 54 



5,477,451 



140U 



1402. 



1403. 



1404, 



word sequence 
annotated with 
part-of-speech 
Jags 













perform question 
inversion 










perform negative 
coalescence 










perform pronoun 
movement 










perform adverb and 
adjective movement 













transformed 
word sequence 
annotated with 
part-of-speech 
tags 



.1104 



Figure 14 



U.S. Patent Dec 1?,1995 Sheet 15 of 54 5,477,451 





Figure 15 



U.S. Patent Dec. 19, 1995 Sheet 16 of 54 5,477,451 



1605 




1602. 



\ 



table of patterns 



1503, 



\ 



table of actions 



Figure 16 



U.S. Patent Dec. 19, 1995 Sheet 17 of 54 5,477,451 



begin patterns and actions; 

HH_NP? . [l] .DO.SET.'not'?. [2] . SUBJECT.NP . [3] -ADVERB*. [4] . 

, BARE. INF.TAG .(#)*.'?' 

-> english.question.inversion; 
(#) + 

-> def ault.action; 
end patterns and actions; 

begin auxiliary patterns; 

ADVERB - .ADV.TAG I 'most' , 'RR! ' I 'more' , 'RR! ' ; 

BARE.NP = ((ADVERB)*. ,ADJ_TAG)*. ,C0MMQN_N0UN_TAG ; 

SUBJECT.NP = (.SUBJECT.TAG) I (,DETERMINER_TAG)?. BARE.NP; 

WH.NP = WH.WORD I (( 'which' I 'what' I 'how' . 'many') .BARE.NP) ; 

end auxiliary patterns; 

begin sets of words; 
DO.SET = {do does did}; 

WH.WORD ■ {what who where why whom how when which); 
end sets of words; 

begin sets of tags; 

PROPER.NOUN.TAG = {NP NP1 NP2}; 

PRONOUN.TAG = {PN PN1 PNQS}; 

SUBJECT TAG = {PROPER.NOUN.TAG PRONOUN.TAG}; 

BARE.INF.TAG = {WO VHO VBO VDO}; 

COMMON.NOUN.TAG = {ND1 NN NN1 NN2 NNJ NNJ1 NNJ2 NNL NNL1 NNL2 
NO NN02 NNS1 NNS2 NNSA1 NNSB NNSB1 
NNSB2 NNT1 NNT2 NNU NNU1 NNU2 NPD1 NPD2 NPM1}; 

ADV.TAG = {RA REX RG RGA RGR RGT RP RPK RR RRR RRT RT}; 

ADJ.TAG = {JA JB JBR JBT J J JJR JJT JK}; 

DETERMINER.TAG = {AT ATI DA DAI DA2 DA2R DA2T DAR DAT DB DB2 DD 
DDI DD2 DDQ DDQ$ DDQV}; 

end sets of tags; 



Figure 17 



U.S. Patent Dec. 19, 1995 Sheet 18 of 54 5,477,451 



1 english_question_ inversion: procedure; 

2 /* This procedure is invoked when the pattern: 

3 WH_NP? . [1] . DORSET . ' not ' ? . [2] . SUB JECT JJP . [3] • ADVERB* . [4] 

4 BARE. INFJTAG. (#)*.'?' is matched */ 
5 

6 set the output sequence to null; 

7 

8 append tuples the beginning of the input sequence 

9 to position Qreg(l)-1 to output; 
10 

11 append tuples from input positions 

12 fflreg(2) to Qreg(3)-1 to output; 
13 

14 if <3reg(2) = Oreg(l) + 2 then do; 

15 append tuples from input positions <Dreg(l) to 

16 Qreg(l)+1 to output; 

17 append tuples from input in positions 

18 Qreg(3) to <Dreg(4) to output; 

19 end; 
20 

21 else do; 

22 append tuples from input in positions 

23 * <Breg(3) to Greg (4) -1 to output; 
24 

25 if the word at post ion <5reg(l) is 'does' then do; 

26 append the input word at position <3reg(4) and the 

27 tag for 'does' to output; 

28 end ; 
29 

30 else if the word at postion @reg(l) is 'do' then do; 

31 append tuple at input position Qreg(4) 

32 end; 
33 

34 else if the word at postion Qreg(l) is 'did' then do; 

35 append the input word at position <Dreg(4) 

36 and the tag for 'did' to output; 

37 end; 
38 

39 else there is an error; 

40 

41 end; 
42 

43 append the tuples from input positions Qreg(4)+1 to 

44 one less than the last position to output; 
45 

46 append the word 'QINV and the tag 'QINV to output; 

47 

48 end english_question_inversion ; 



Figure 18 



U.S. Patent Dec. 19, 1995 Sheet 19 of 54 



5,477,451 



1901 



1902 



1903 



1904 



1905 



1906 



1907 



1908 



1909 



1910 




generate nouns, adverbs 
and adjectives from 
morphological units 



I 



provide do-support 
for negatives 



I 



provide do-support 
for inverted questions 



separate combined 
verb tenses 



conjugate verbs 



reposition adjectives 
and adverbs 



restore question 
inversion 



resolve target-word 
ambiguities resulting 
from ambiguities in 
the morphology 



I 



case target sentence 



704 




Figure 19 



U.S. Patent Dec. 19, 1995 Sheet 20 of 54 5,477,451 







target structure 
language model 










705 



Figure 20 



detailed 

translation model 



2101 



target structure to 
source structure 
translation model 




706 

Figure 21 

John V _past_3s kissed Mary 



Jean V_past_3s embrasser Marie 



Figure 22 



U.S. Patent Dec. 19, 1995 Sheet 21 of 54 



5,477,451 




U.S. Patent Dec 19, 1995 Sheet 22 of 54 5,477,451 










generate allowed alignments 
between the source structure 
and the target structure 










compute the score of 
each alignment 










sum the scores of 
all alignments 










detailed 

translation model 



\ 



2101 



Figure 24 



U.S. Patent 



Dec 19, 1995 



Sheet 23 of 54 



5,477,451 



table of 

parameter values 




Figure 25 



i 



U.S. Patent Dec. 19, 1995 Sheet 24 of 54 5,477,451 




fertility 
submodel 



lexical 
submodel 



distortion 
submodel 



probability of 
source structure 
and alignment 
given target structure 



table of 

fertility parameters 



table of 

lexical parameters 



table of 

distortion probabilities 



2501 



2601 



2602 



2603 



Figure 2G 



U.S. Patent Dec 19, 1995 




5,477,451 



"T 7 



2702 



Set i to 0 



Select parameter values 
for model 0 



2703 



2704 



Increment i by 1 



Using parameter values 
of model i-1 

select initial parameter values 
for model 2 



Starting from initial parameter values 
for model i 

find locally optimal parameter values 
for model I 



270X 



2705 



2706 




Figure 27 



U.S. Patent Dec 19, 1995 Sheet 26 of 54 5,477,451 




Figure 28 



U.S. Patent Dec 19, 1995 Sheet 27 of 54 



5,477,451 

2804 



parameter values 
© for model P 



choose 9 
maximize 


to 

R(Pg,P 




• 



initial parameter 
values 9 for 
model P 



2901 



2902 



Figure 29 



U.S. Patent Dec. 19, 1995 Sheet 28 of 54 



plan 
letter 
request 
memo 
case 
question 
charge 
statement 
draft 



evaluation 
assessment 
analysiK« 
understanding- 
opinion, 
conversation — i 
discussion — » 



;□ 




reps 

representatives 
representative 
rep 



accounts 
people 
customers 
individuals- 
employees 
students 




Figure 30 



U.S. Patent 



Dec 19, 1995 



Sheet 29 of 54 



5,477,451 




v. 

3104 



Figure 31 



U.S. Patent Dec. 19, 1995 Sheet 30 of 54 5,477,451 



no 



Assign each word 
to its own class 



Select the pair of classes 
for which a merge 
results in the smallest 
loss of mutual information 



Merge the selected 
pair of classes 




yes 



3203 



Model for 
bigrams 



3103 



3201 



3202 



3204 



Figure 32 



U.S. Patent Dec 19, 1995 Sheet 31 of 54 5,477,451 



And] the 2 programs ha.s 4 been-, implemented^ 




Lei programme-i a-3 ete 4 mis.-, eii 6 application^ 



Figure 33 



The, 
lDalance-2 
was.3 
the 4 
territory.^ 

of 6 
the.7 
aboriginals 
peopleg 




Le, 
reste-2 

appartenaitg 
aux^ 

autochtones.-, 



Figure 34 



Thei pooro doift.3 have 4 any:, money fi 




Lesi pauvreso .sont :{ denmnis. 



Figure 35 



U.S. Patent 



Dec. 19, 1995 



Sheet 32 of 54 



5,477,451 



cheap 



bon 



bpn 



cheap 



marche marche 



marche 



bpn 



1)011 




marche 



f 

Figure 36 



Whati 
the 3 

anticipate 
costs 
of 6 

administering7 
ands 
collectings) 
fees 10 
underii 

thei2 
newj3 
proposal^ 

•?15 




Eii i 
virtu-? 
de : j 
lesj 

nouvelles.-, 
propositions^ 

•7 

quels 
est., 

Icio 

(OUt ii 

pievujj 
de 1;} 

administration^ 
de l( > 

percept ion 1 7 

les,,, 
droit sjo 



Figure 37 



U.S. Patent Dec 19, 1995 Sheet 33 of 54 5,477,451 

The! Lei 

secretary* secretaire^ 

of3 (le ;J 

state 4 Etat 4 

for.5 . a 5 

external k\s 6 

affairs 7 Affaires 7 

comess exterieur 8 

asio present e 10 

then common 

one 12 le 12 

supporter 13 S oul 1: , 

°fi4 defenseui] 4 

thei 5 — clc 13 

embattledie -^ L. lo I6 

ministers X ministry 

°Us \v C|lli|j5 

yesterday^ \- / .se i9 

\^ \ l)ous( uler-22 



Figure 38 



U.S. Patent Dec. 19, 1995 Sheet 34 of 54 5,477,451 



Mr.i ; Monsieuii 

Speaker-2 ' 2 

,3 Orateur :J 

if 4 ■ .4 

we 5 — — si 5 

mighte — — nouso 

retunij — pouvons- 

to s revenirs 

starredg * ■ an 

questionsio JX. l^io 

, n "n^" — — • questions 1 1 

I12 \ ^\ inarciuoesi-2 

110W13 \\ \ /> ^ 0|:} 

haven \\. V 111111 

their, \\ astorisciuci:, 

answer 16 /\ \^ -in 

and 17 \ Fit 

willis ^ donneraius 

givei9 - — "^"^v k* 19 

it 20 ^ reponse 2 o 

t02i — a>i 

the 2 2 l a 22 

house 2 :j chainbivon 



Figure 39 



U.S. Patent Dec 19, 1995 Sheet 35 of 54 



5,477,451 



4001 




e vais vprendrexma propre 



decision 



1 



I will make my own decision 



4002 




Je vais Vprendreyma propre 



Ml 



voiture 



I will take my own car 



vocabulary word: prendre 



informant site: first noun to the right 

informant word: 

first sequence: decision 
second sequence: voiture 



Figure 40 



U.S. Patent Dec. 19, 1995 Sheet 36 of 54 5,477,451 



4107 



4108 



4109 




4101 



input sequence 



capture next word 



get best informant 
site for word 



capture informant 
word at site 



obtain class of 
informant 



label word with 
class of informant 




input sequence 
I annotated with J 
^vgense-labels 



4102 



4104 



4105 



4103 



table of informants 
and questions for 
every vocabulary 
word 



4106 



Figure 41 



U.S. Patent Dec 19, 1995 Sheet 37 of 54 5,477,451 



4201 



4202 




yes 




next word 








lind the best question 
for each informant site 






store the informant site 
and the best question 



no 



4203 



table of possible 
informants 



tabie of probabilities 
derived from Viterbi 
alignments 



table of informants 
and questions for 
every vocabulary 
word 



4204 



4207 



4205 



4206 



Figure 42 



U.S. Patent Dec 19, 1995 Sheet 38 of 54 5,477,451 



4302 



4207 




4303 



4301 



next informant site 



find a question about 
the informant that provides 
a lot of information about 
translations of the vocabulary 
word Into the target language 



store vocabulary word, 
informant site and question 



yes 




no 



table of possible 
informants 



table of probabilities 
derived from Viterbi 
alignments 



table of informants 
and questions for 
every vocabulary 
word 



4303 



4304 



4205 



4206 



Figure 43 



U.S. Patent Dec 19, 1995 Sheet 39 of 54 5,477,451 



initial choice of 
question c 











4401 



A 




for fixed c, 


find the best q: 


q(e|c) = 1/norm Vp(e,x|f) 




x: £(x) =c 



for fixed q, find the best 
question <5 : 

6(x) = argmin D( p(E|x,f) ; q(E|c) ) 



no 




4404 



yes 




4405 



4402 



4205 



table of alignment 
probabilities 



4403 



Figure 44 



U.S. Patent Dec 19, 1995 Sheet 40 of 54 



5,477,451 




4505 



4503 



Till ZK 




aligned 
words 



Figure 45 



U.S. Patent Dec. 19,1995 Sheet 41 of 54 5,477,451 





Tokf 


mize 






Dete 
sent 


rmine 
ences 



4601 4602 



^4603 4604^ 



Align sentences 
that are mutual 
translations 



Tokc 


;nize 






Dete 
sent 


rmine 
ences 



1605 




Figure 46 




Find 

major and minor 
anchor points 







Align 

major anchor points 






Discard 

some aligned sections 






Align 

sentences within 
subsections 











4701 



4702 



4703 



4704 



4605 



Aligned sentences 



Figure 47 



U.S. Patent 



Dec 19, 1995 



Sheet 43 of 54 



5,477,451 



/♦START. COMMENT* Beginning file = 048 

101 H002-108 script A *END_C0MMENT*/ 

.TB 029 060 090 099 

.PL 060 

.LL 120 

.NF 

The House met at 2 p.m. 
.SP 

♦boMr. Donald Maclnnis (Cape Breton 
-East Richmond) :*ro Mr. Speaker, 
I rise on a question of privilege af- 
fecting the rights and prerogatives 
of parliamentary committees and one 
which reflects on the word of two 
ministers. 
.SP 

*boMr. Speaker: *roThe hon. member's 

motion is proposed to the 
House under the terms of Standing 
Order 43. Is there unanimous consent? 
.SP 

♦boSorae hon. Members: *roAgreed. 
s*itText*ro) 

Question No, 17 — *boMr. Mazankowski: 
*ro 

1, For the period April 1, 1973 to 
January 31, 1974, what amount of 
money was expended on the operation 
and maintenance of the Prime 
Minister's residence at Harrington 
Lake, Quebec? 
.SP 

(1415) 

s*itLater :*ro) 
.SP 

*boMr. Cossitt:*ro Mr. Speaker, I rise 
on a point of order to ask for 
clarification by the parliamentary 
secretary . • 



1. \SCH{} Document = 048 101 H002-108 
script A \ECM{} 



2. The House met at 2 p.m. 

3. \SCM{} Paragraph \ECM{} 

4. \SCM{} Author = Mr. Donald Maclnnis 
(Cape Breton-East Richmond) \ECM{} 

5. Mr. Speaker, I rise on a question* of 
privilege affecting the rights and 
prerogatives of parliamentary 
committees and one which reflects on 
the word of two ministers. 

21. \SCMO Paragraph \ECM{} 

22. \SCM{} Author = Mr. Speaker \ECM{} 

23. The hon. member's motion is proposed 
to the House under the terms of 
Standing Order 43. 

44. Is there unanimous consent? 

45. \SCM{} Paragraph \ECM{} 

46. \SCMO Author = Some hon. Members 
\ECMO 

47. Agreed. 

61. \SCM{} Source = Text \ECM{> 

62. \SCM{} Question = 17 \ECM{} 

63. \SCM{} Author = Mr. Mazankowski 
\ECM{} 

64. 1. 

65.. For the period April 1, 1973 to 
January 31, 1974, what amount of 
money was expended on the operation 
and maintenance of the Prime 
Minister's residence at Harrington 
Lake, Quebec? 

66. \SCM{} Paragraph \ECM{} 

81. \SCM{} Time = (1415) \ECM{} 

82. \SCM{} Time = Later \ECM{} 

83. \SCMO Paragraph \ECM{} 

84. \SCM{} Author = Mr. Cossitt \ECM{} 

85. Mr. Speaker, I rise on a point of 
order to ask for clarification by 
the parliamentary secretary. 



Figure 48 



U.S. Patent 



Dec 19, 1995 Sheet 44 of 54 



5,477,451 




Figure 49 



BEAD 

ft 



STOP 



Figure 50 



Bead Ttxi 

e one English sentence 

/ one French sentence 

cf one English and one -French sentence 

ttf two English and one French sentence 

iff one English and two French sentences 

K, one English paragraph 

%j one French paragraph 

11,11/ one English and one Frenck paragraph 

Figure 51 



U.S. Patent 



Dec. 19, 1995 Sheet 45 of 54 



5,477,451 




aentence length 



Figure 52 



U.S. Patent Dec. 19,1995 Sheet 46 of 54 5,477,451 




sentence length 



Figure 53 



U.S. Patent Dec 19, 1995 Sheet 47 of 54 5,477,451 



702 



initialize set of partial 
hypotheses and scores 



-5401 



.5402 



select partial hypotheses to 
extend to new hypotheses 




5403 



5404 



\ 



no 



extend selected partial 
hypotheses 



705 



/ 



target-structure 
language model 



target-structure 
to 

source-structure 
translation model 



\ 



706 




done 



Figure 54 



U.S. Patent Dec. 19, 1995 Sheet 48 of 54 5,477,451 

La jeune fille a reveille sa mere 



transduce 
source text 



la jeune fille V_past_3s reveiller sa mere 



Figure 55 



the 



la jeune fille V_past_3s reveiller sa mere 



Figure 56 



mother 



la jeune fille V_past_3s reveiller sa mere 



Figure 57 



U.S. Patent Dec 19, 1995 Sheet 49 of 54 5,477,451 

the girl + 



la jeune fille V_past_3s reveiller sa mere 



Figure 58 



the girl 



la jeune fille V_past_3s reveiller sa mere 



Figure 59 



the girl + 



la jeune fille V_past_3s reveiller sa mere 



Figure 60 



the girl V_past_3s to _ wake 



la jeune fille V_past_3s reveiller sa mere 



Figure 61 



U.S. Patent 



Dec 19, 1995 



Sheet 50 of 54 



5,477,451 



the girl V_past_3s to _ wake up her 




la jeune fille V_past_3s reveiller sa mere 
Figure 62 



the girl V_past__3s to _ wake up her + 



la jeune fille V_past_3s reveiller sa mere 



Figure 63 




H.2J 



{1.3) 




1.2,31 



(2.3) 



Figure 64 



U.S. Patent Dec. 19,1995 Sheet 51 of 54 



5,477,451 




Figure 65 



U.S. Patent Dec 19, 1995 Sheet 52 of 54 5,477,451 



6601 



initialize threshold of every 
active priority queue to infinity 



5402 



6602 



S \ v set thresholds for every active 
priority queue on the frontier 



6603 



set i s maximum cardinality of 
any priority queue on the frontier 



6604 




:i=o? 



yes 



6605 



6611 



no 



set j s 1 



6606 




set k = number of priority 
queues of cardinality i 



6607 



6608 




yes 



process j-th priority 
queue of cardinality i 



M*1 



i = i-1 



6609 



6610 



Figure G6 



U.S. Patent Dec. 19, 1995 Sheet 53 of 54 5,477,451 



6608 



\ 



6701 



6702 



6703 



^^riority queue Q^^^ 



let t = threshold associated with Q 



let i = partial hypothesis in Q with the greatest score 



6704 




no 



6705 



yes 



add i to list to be extended 



6706 



use I to adjust thresholds 
in parent priority queues 



6707 



i = partial hypothesis with 
next largest score in Q 



» ^done ^ 



Figure 07 



U.S. Patent Dec. 19, 1995 Sheet 54 of 54 5,477,451 



1 procedure: extend.partial.hypotheses.on.list ; 

2 do for i = 1 to the number of partial hypotheses on list; 

3 let h be the i-th partial hypothesis to be extended; 

4 do for p = 1 to the number of positions in the source structure. 

5 if p is not already aligned in h then 

6 extend.h.by.accounting.for_source_morpheme_in_position p; 

7 end of do for p; 

8 end of do for i; 

9 end of procedure extend.partial.hypotheses.on.list ; 

10 procedure: extend.h.by. accountings or.source.morpheme.in.position.p; 

11 if h is an open partial hypothesis then do; 

12 let q be the position of the open target morpheme in h; 

13 extend.open.h.by.connect ing.p.to.q.and.keep.h.open ; 

14 extend.open.h.by.connect ing.p.to.q.and.close.h; 

15 end of if h; 

16 if h is not an open partial hypothesis then do; 

17 let s be the source morpheme in position p; 

18 do j b 1 to the number of target morpheme translations of s; 

19 let t be the j-th target morpheme translation of s; 

20 create. open.extension.of .h.by.connecting.p.to.t ; 

21 create.closed. extension.of .h.by .connect ing.p.to.t ; 

22 create. extension. of .h.by.connect ing.p.to. null. t^get. morpheme; 

23 create, list, of _target.morphemes.to_be.inserted.bef ore.t ; 

24 do k c 1 to number of target morphemes on listed to be inserted ; 

25 let tl be the k-th target morpheme to be inserted; 

26 create_open.extenstion.of .by.connecting.p.to.t l.t ; 

27 create_closed.extenstion.of .by.connecting.p.to.t l.t ; 

28 end of do k; 

29 end of do j ; 

30 end of if h; 

31 end of procedure extend.h.by.accounting.f or.source.raorpherae.in.position.p; 



Figure 68 



5,477,451 



METHOD AND SYSTEM FOR NATURAL 
LANGUAGE TRANSLATION 



TECHNICAL FIELD 

The present invention relates to machine translation in 
general, and, in particular to methods of performing machine 
translation which use statistical models of the translation 
process. 

BACKGROUND ART 



10 



25 



30 



35 



There has long been a desire to have machines capable of 
translating text from one language into text in another 
language. Such machines would make it much easier for 15 
humans who speak different languages to communicate with 
one another. The present invention uses statistical tech- 
niques to attack the problem of machine translation. Related 
statistical techniques have long been used in the field of 
automatic speech recognition. The background for the 20 
invention is thus in two different fields: automatic speech 
recognition and machine translation. 

The central problem in speech recognition is to recover 
from an acoustic signal the word sequence which gave rise 
to it. Prior to 1970, most speech recognition systems were 
built around a set of hand- written rules of syntax, semantics 
and acoustic-phonetics. To construct such a system it is 
necessary to firstly discover a set of linguistic rules that can 
account for the vast complexity of language, and, secondly 
to construct a coherent framework in which these rules can 
be assembled to recognize speed. Both of these problems 
proved insurmountable. It proved too difficult to write down 
by hand a set of rules that adequately covered the vast scope 
of natural language and to construct by hand a set of weights, 
priorities and if-then statements that can regulate interac- 
tions among its many facets. 

This impasse was overcome in the early 1970's with the 
introduction of statistical techniques to speech recognition. 
In the statistical approach, linguistic rules are extracted ^ 
automatically using statistical techniques from large data- 
bases of speech and text. Different types of linguistic infor- 
mation are combined via the formal laws of probability 
theory. Tbday, almost all speech recognition systems are 
based on statistical techniques. 45 

Speech recognition has benefited by using statistical lan- 
guage models which exploit the fact that not all word 
sequences occur naturally with equal probability. One 
simple model is the trigram model of English, in which it is 
assumed that the probability that a word will be spoken 50 
depends only on the previous two words that have been 
spoken. Although trigram models are simple-minded, they 
have proven extremely powerful in their ability to predict 
words as they occur in natural language, and in their ability 
to improve the performance of natural-language speech 55 
recognition. In recent years more sophisticated language 
models based on probabilistic decision-trees, stochastic con- 
text-free grammars and automatically discovered classes of 
words have also been used. 

In the early days of speech recognition, acoustic models 60 
were created by linguistic experts, who expressed their 
knowledge of acoustic-phonetic rules in programs which 
analyzed an input speech signal and produced as output a 
sequence of phonemes. It was thought to be a simple matter 
to decode a word sequence from a sequence of phonemes. It 65 
turns out, however, to be a very difficult job to determine an 
accurate phoneme sequence from a speech signal. Although 



human experts certainly do exist, it has proven extremely 
difficult to formalize their knowledge. In the alternative 
statistical approach, statistical models, most predominantly 
hidden Markov models, capable of learning acoustic-pho- 
netic knowledge from samples of speech are employed. 

The present approaches to machine translation are similar 
in their reliance on hand-written rules to the approaches to 
speech recognition twenty years ago. Roughly speaking, the 
present approaches to machine translation can be classified 
into one of three categories: direct, interlingual, and transfer. 
In the direct approach a series of deterministic linguistic 
transformations is performed. These transformations 
directly convert a passage of source text into a passage of 
target text In the transfer approach, translation is performed 
in three stages: analysis, transfer, and synthesis. The analysis 
stage produces a structural representation which captures 
various relationships between syntactic and semantic 
aspects of the source text In the transfer stage, this structural 
representation of the source text is then transferred by a 
series of deterministic rules into a structural representation 
of the target text. Finally, in the synthesis stage, target text 
is synthesized from the structural representation of the target 
text The interlingual approach to translation is similar to the 
transfer approach except that in the interlingual approach an 
internal structural representation that is language indepen- 
dent is used. Translation in the interlingual approach takes 
place in two stages, the first analyzes the source text into this 
language-independent interlingual representation, the sec- 
ond synthesizes the target text from the interlingual repre- 
sentation. All these approaches use hand-written determin- 
istic rules. 

Statistical techniques in speech recognition provide two 
advantages over the rule-based approach. First, they provide 
means for automatically extracting information from large 
bodies of acoustic and textural data, and second, they 
provide, via the formal rules of probability theory, a sys- 
tematic way of combining information acquired from dif- 
ferent sources. The problem of machine translation between 
natural languages is an entirely different problem than that 
of speech recognition. In particular, the main area of 
research in speech recognition, acoustic modeling, has no 
place in machine translation. Machine translation does face 
the difficult problem of coping with the complexities of 
natural language. It is natural to wonder whether this prob- 
lem won't also yield to an attack by statistical methods, 
much as the problem of coping with the complexities of 
natural speech has been yielding to such an attack. Although 
the statistical models needed would be of a very different 
nature, the principles of acquiring rules automatically and 
combining them in a mathematically principled fashion 
might apply as well to machine translation as they have to 
speech recognition. 

DISCLOSURE OF THE INVENTION 

The present invention is directed to a system and method 
for translating source text from a first language to target text 
in a second language different from the first language. The 
source text is first received and stored in a first memory 
buffer. Zero or more user defined criteria pertaining to the 
source text are also received and stored by the system. The 
user defined criteria arc used by various subsystems to 
bound the target text. 

The source text is then transduced into one or more 
intermediate source structures which may be constrained by 
any of the user defined criteria. One or more target hypoth- 
eses are then generated. Each of the target hypotheses 
comprise a intermediate target structure of text selected from 
the second language. The intermediate target structures may 



5,477,451 



also be constrained by any of the user defined criteria 

A target- structure language model is used to estimate a 
first score which is proportional to the probability of occur- 
rence of each intermediate target-structure of text associated 
with the target hypotheses. A target structure-to-source- 
structure translation model is used to estimate a second score 
which is proportional to the probability that the intermediate 
target-structure of text associated with the target hypotheses 
will translate into the intermediate source-structure of text 
For each target hypothesis, the first and second scores are 
combined to produce a target hypothesis match score. 

The intermediate target-structures of text are then trans- 
duced into one or more transformed target hypothesis of text 
in the second language. This translation step may also be 
constrained by any of the user defined criteria. 

Finally, one or more of the transformed target hypotheses 
is displayed to the user according to its associated match 
score and the user defined criteria. Alternatively, one or more 
of the transformed target hypotheses may be stored in 
memory or otherwise made available for future access. 

The intermediate source and target structures are trans- 
duced by arranging and rearranging, respectively, elements 
of the source and target text according to one or more of a 
lexical substitution, part-of-speech assignment, a morpho- 
logical analysis, and a syntactic analysis. 

The intermediate target structures may be expressed as an 
ordered sequence of morphological units, and the first score 
probability may be obtained by multiplying the conditional 
probabilities of each morphological unit within an interme- 
diate target structure given the occurrence of previous mor- 
phological units within the intermediate target structure. In 
another embodiment, the conditional probability of each unit 
of each of the intermediate target structure may depend only 
on a fixed number of preceding units within the intermediate 
target structure. 

The intermediate source structure transducing step may 
further comprise the step of tokenizing the source text by 
identifying individual words and word separators and 
arranging the words and the word separators in a sequence. 
Moreover, the intermediate source structure transducing step 
may further comprise the step of determining the case of the 
words using a case transducer, wherein the case transducer 
assigns each word a token and a case pattern. The case 
pattern specifies the case of each letter of the word. Each 
case pattern is evaluated to determine the true case pattern 
of the word. 

BRIEF DESCRIPTION OF DRAWINGS 

. The above stated and other aspects of the present inven- 
tion will become more evident upon reading the following 
description of preferred embodiments in conjunction with 
the accompanying drawings, in which: 

FIG. 1 is a schematic block diagram of a simplified 
French to English translation system; 

FIG. 2 is a schematic block diagram of a simplified 
French to English translation system which incorporates 
analysis and synthesis; 

FIG. 3 is a schematic block diagram illustrating a manner 
in which statistical transfer can be incorporated into a 
translation system based on an interlingua; 

FIG. 4 is a schematic flow diagram of a basic system 
operating in batch mode; 

FIG. 5 is a schematic flow diagram of a basic system 
operating in user-aided mode; 



10 



20 



25 



30 



35 



40 



45 



50 



55 



60 



65 



FIG. 6 is a schematic representation of a work station 
environment through which a user interfaces with a user- 
aided system; 

FIG. 7 is a schematic flow diagram of a translation 
component of a batch system; 

FIG. 8 is a schematic flow diagram illustrating a user's 
interaction with a human-aided system; 

FIG. 9 is a schematic flow diagram of a translation 
component of a human-aided system; 

FIG. 10 is a schematic block diagram of a basic hard ward 
components used in a preferred embodiment of the present 
invention; 

FIG. 11 is a schematic flow diagram of a source trans- 
ducer, 

• FIG. 12 is a schematic flow diagram of a token-to-word 
transducer, 

FIG. 13 is a schematic flow diagram of a syntactic 
transducer for English; 

FIG. 14 is a schematic flow diagram of a syntactic 
transducer for French. 

FIG. 15 is an example of a syntactic transduction; 

FIG. 16 is a schematic flow diagram illustrating the 
operation of a finite-stale transducer, 

FIG. 17 is a Table of Patterns for a finite-state pattern 
matcher employed in some source and target transducers; 

FIG. 18 is a Table of Actions for an action processing 
module employed in some source and target transducers. 

FIG. 19 is a schematic flow diagram of a target transducer, 

FIG. 20 is a schematic block diagram of a target structure 
language model 

FIG. 21 is a schematic block diagram of a target structure 
to source structure translation model; 

FIG. 22 is a simple example of an alignment between a 
source structure and a target structure; 

FIG. 23 is another example of an alignment; 

FIG. 24 is a schematic flow diagram of showing an 
embodiment of a target structure to source structure trans- 
lation model; 

FIG. 25 is a schematic block diagram of a detailed 
translation model; 

FIG. 26 is a schematic flow diagram of an embodiment of 
a detailed translation model; 

FIG. 27 is a schematic flow diagram of a method for 
estimating parameters for progression of gradually more 
sophisticated translation models; 

FIG. 28 is a schematic flow diagram of a method for 
iteratively improving parameter estimates for a model; 

FIG. 29 is a schematic flow diagram of a method for using 
parameters values for one model to obtain good initial 
estimates of parameters for another model; 

FIG. 30 shows some sample subtrees from a tree con- 
structed using a clustering method; 

FIG. 31 is a schematic block diagram of a method for 
partitioning a vocabulary into classes; 

FIG. 32 is a schematic flow diagram of a method for 
partitioning a vocabulary into classes; 

FIG. 33 is an alignment with independent English words; 

FIG. 34 is an alignment with independent French words; 

FIG. 35 is a general alignment; 

FIG. 36 shows two tableaux for one alignment; 

FIG. 37 shows the best alignment out of 1.9X10 23 align- 
ments; 



5,477,451 



FIG. 38 shows the best of 8.4X10 29 possible alignments; 

FIG. 39 shows the best of 5.6X10 31 alignments; 

FIG. 40 is an example of informants and informant sites; 

FIG. 41 is a schematic flow diagram illustrating the 
operation of a sense-labeling transducer, 

FIG. 42 is a schematic flow diagram of a module that 
determines good questions about informants for each 
vocabulary word; 

FIG. 43 is a schematic flow diagram of a module that J0 
determines a good question about each informant of a 
vocabulary word; 

FIG. 44 is a schematic flow diagram of a method for 
determining a good question about an informant; 

FIG. 45 is a schematic diagram of a process by which 15 
word-by-word correspondences are extracted from a bilin- 
gual corpus; 

FIG. 46 is a schematic flow diagram of a method for 
aligning sentences in parallel corpora; 

FIG. 47 is a schematic flow diagram of the basic step of 20 
a method for aligning sentences; 

FIG. 48 depicts a sample of text before and after textural 
cleanup and sentence detection; 

FIG. 49 shows an example of a division of aligned copora 
into beads; 25 

FIG. 50 shows a finite state model for generating beads; 

FIG. 51 shows the beads that are allowed by the model; 

FIG. 52 is a histogram of French sentence lengths; 

FIG. 53 is a histogram of English sentence lengths; 30 

FIG. 54 is a schematic flow diagram of the hypothesis 
search component of the system; 

FIG. 55 is an example of a source sentence being trans- 
duced to a sequence of morphemes; 

FIG. 56 is an example of a partial hypothesis which 
results from an extension by the target morpheme the; 

FIG. 57 is an example of a partial hypothesis which 
results from an extension by the target morpheme mother, 

HG. 58 is an example of a partial hypothesis which 40 
results from an extension with an open target morpheme; 

FIG. 59 is an example of a partial hypothesis which 
results from an extension in which an open morpheme is 
closed; 

FIG. 60 is an example of a partial hypothesis which 45 
results from an extension in which an open morpheme is 
kept open; 

FIG. 61 is an example of a partial hypothesis which 
results from an extension by the target morpheme to_wake; 5Q 

FIG. 62 is an example of a partial hypothesis which 
results from an extension by the pair of target morphemes up 
and to_wake; 

FIG. 63 is an example of a partial hypothesis which 
results from an extension by the pair of target morphemes up 55 
and to_wake in which to_wake is open; 

FIG. 64 is an example of a subset lattice; 

FIG. 65 is an example of a partial hypothesis that is stored 
in the priority queue {2,3}; 

FIG. 66 is a schematic flow diagram of the process by 
which partial hypotheses are selected for extension; 

FIG. 67 is a schematic flow diagram of the method by 
which the partial hypotheses on a priority queue are pro- 
cessed in the selection for extension step; 55 

FIG. 68 contains pseudocode describing the method for 
extending hypotheses. 



35 



60 



BEST MODE FOR CARRYING OUT THE INVENTION 



Contents 

1 Introduction 

2 Translation Systems 

3 Source Transducers 

3.1 Overview 

3.2 Components 

3.2.1 Tokenizing Transducers 

3.2.2 Token- Word Transducers 

3.2.3 True-Case Transducers 

3.2.4 Specialized Transformation Transducers 

3.2.5 Part-of-Speech Labeling Transducers 

3.2.6 Syntactic Transformation Transducers 

3.2.7 Syntactic Transducers for English 

3.2.8 Syntactic Transducers for French 

3.2.9 Morphological Transducers 

3.2.10 Sense-Labelling Transducers 

3.3 Source-Transducers with Constraints 

4 Finite-State Transducers 

5 Target Transducers 

5.1 Overview 

5.2 Components 

5.3 Target Transducers with Constraints 

6 Target Language Model 

6.1 Perplexity 

6.2 n-gram Language models 

6.3 Simple n-gram models 

6.4 Smoothing 

6.5 n-gram Class Models 

7 a asses 

7.1 Maximum Mutual Information Clustering 

7.2 A Clustering Method 

7.3 An Alternate Clustering Method 

7.4 Improving Classes 

7.5 Examples 

7.6 A Method for Constructing Similarity Trees 

8 Overview of Translation Models and Parameter Estimation 

8.1 Translation Models 

8.1.1 Training 

8.1.2 Hidden Alignment Models 

8.2 An example of a detailed translation model 

8.3 Iterative Improvement 

8.3.1 Relative Objective Function 

8.3.2 Improving Parameter Values 

8.3.3 Going From One Model to Another 

8.3.4 Parameter Reestimation Formulae 

8.3.5 Approximate and Viterbi Parameter Estimation 

8.4 Five Models 

9 Detailed Description of Translation Models and Parameter 
Estimation 

9.1 Notation 

9.2 Translations 

9.3 Alignments 

9.4 Model 1 

9.5 Model 2 

9.6 Intermodel interlude 

9.7 Model 3 

9.8 Deficiency 

9.9 Model 4 

9.10 Model 5 

9.11 Results 

9.12 Better Translation Models 

9.12.1 Deficiency 

9.12.2 Multi-word Notions 

10 Mathematical Summary of Translation Models 



5,477,451 



8 



20 



25 



10.1 Summary of Notation 

10.2 Model 1 

10.3 Model 2 

10.4 Model 3 

10.5 Model 4 5 

10.6 Model 5 

11 Sense Disambiguation 

11.1 Introduction 

1 1 .2 Design of a Sense-Labeling Transducer 

11.3 Constructing a table of informants and questions 10 

11.4 Mathematics of Constructing Questions 

11.4.1 Statistical Translation with Transductions 

11.4.2 Sense-Labeling in Statistical Translation 

1 1 .4.3 The Vlterbi Approximation 

11.4.4 Cross Entropy 15 

11.5 Selecting Questions 

11.5.1 Source Questions 

11.5.2 Target Questions 

11.6 Generalizations 

12 Aligning Sentences 

12.1 Overview 

12.2 Tokerazation and Sentence Detection 

12.3 Selecting Anchor Points 

12.4 Aligning Major Anchors 

12.5 Discarding Sections 

12.6 Aligning Sentences 

12.7 Ignoring Anchors 

12.8 Results for the Hansard Example 

13 Aligning Bilingual Corpora 

14 Hypothesis Search — Steps 702 and 902 3Q 

14.1 Overview of Hypothesis Search 

14.2 Hypothesis Extension 5404 
14.2.1 Types of Hypothesis Extension 

14.3 Selection of Hypotheses to Extend 5402 
1 Introduction 

The inventions described in this specification include 
apparatuses for translating text in a source language to text 
in a target language. The inventions employ a number of 
different components. There are various embodiments of 
each component, and various embodiments of the appara- 4Q 
tuses in which the components are connected together in 
different ways. 

To introduce some of the components, a number of 
simplified embodiments of a translation apparatus will be 
described in this introductory section. The components 45 
themselves will be described in detail in subsequent sec- 
tions. More sophisticated embodiments of the apparatuses, 
including the preferred embodiments, will also be described 
in subsequent sections. 

The simplified embodiments translate from French to 50 
English. These languages are chosen only for the purpose of 
example. The simplified embodiments, as well as the pre- 
ferred embodiments, can easily be adapted to other natural 
languages, including but not limited to, Chinese, Danish, 
Dutch, English, French, German, Greek, Italian, Japanese, 55 
Portuguese, and Spanish, as well as to artificial languages 
including but not limited to database query languages such 
as SQL, and the programming languages such as LISP. For 
example, in one embodiment queries in English might be 
translated into an artificial database query language. 60 

A simplified embodiment 104 is depicted in FIG. 1. It 
includes three components: 

1. A language model 101 which assigns a probability or 
score P(E) to any portion of English text E; 

2. A translation model 102 which assigns a conditional 65 
probability or score P(FIE) to any portion of French text 

F given any portion of English text E; and 



35 



3. A decoder 103 which given a portion of French text F 
finds a number of portions of English text E, each of 
which has a large combined probability or score 



P(E,F)=P(E)WE). 



(1) 



Language models are described in Section 6, and trans- 
lation models are described in Sections 8, 9, and 10. An 
embodiment of the decoder 103 is described in detail in 
Section 14. 

A shortcoming of the simplified architecture depicted in 
FIG. 1 is that the language model 101 and the translation 
model 102 deal directly with unanalyzed text. The linguistic 
information in a portion of text and the relationships 
between translated portions of text is complex, involving 
linguistic phenomena of a global nature. However, the 
models 101 and 102 must be relatively simple so that then- 
parameters can be reliably estimated from a manageable 
amount of training data. In particular, they are restricted to 
the description of local structure. 

This difficulty is addressed by integrating the architecture 
of FIG. 1 with the traditional machine translation architec- 
ture of analysis, transfer, and synthesis. The result is the 
simplified embodiment depicted in FIG. 2. It includes six 
components: 

1. a French transduction component 201 in which a 
portion of French text F is encoded into an intermediate 
structure F; 

2. a statistical transfer component 202 in which F is 
translated to a corresponding intermediate English 
structure E'; 

3. an English transduction component 203 in which a 
portion of English text E is reconstructed from E'; 

This statistical transfer component 202 is similar in func- 
tion to the whole of the simplified embodiment 104, except 
that it translates between intermediate structures instead of 
translating directly between portions of text The building 
blocks of these intermediate structures may be words, mor- 
phemes, syntactic markers, semantic markers, or other units 
of linguistic structure. In descriptions of aspects of the 
statistical transfer component, for reasons of exposition, 
these building blocks will sometimes be referred to as words 
or morphemes, morphs, or morphological units. It should be 
understood, however, that the unless stated otherwise the 
same descriptions of the transfer component apply also to 
other units of linguistic structure. 

The statistical transfer component incorporates 

4. an English structure language model 204 which assigns 
a probability or score P(E') to any intermediate struc- 
ture E'; 

5. an English structure to French structure translation 
model 205 which assigns a conditional probability or 
score PCFIE 1 ) to any intermediate structure F given any 
intermediate structure E'; and 

6. a decoder 206 which given a French structure F finds 
a number of English structures E\ each of which has a 
large combined probability or score 



P(E , ,F)=P(F)P(FIE , ). 



(2) 



The role of the French transduction component 201 and 
the English transduction component 203 is to facilitate the 
task of the statistical transfer component 202. The interme- 
diate structures E* and F encode global linguistic facts about 
E and F in a local form. As a result, the language and 
translation models 204 and 205 for E' and F will be more 
accurate than the corresponding models for E and F. 



5,477,451 



10 



10 



15 



20 



The French transduction component 201 and the English 
transduction component 203 can be composed of a sequence 
of successive transformation in which F is incrementally 
constructed from F, and E is incrementally recovered from 
E\ For example, one transformation of the French transduc- 
tion component 201 might label each word of F with the part 
of speech, e.g. noun, verb, or adjective, that it assumes in F. 
The part of speech assumed by a word depends on other 
words of F, but in F it is encoded locally in the label. 
Another transformation might label each word of F with a 
cross-lingual sense label. The label of a word is designed to 
elucidate the English translation of the word 

Various embodiments of French transduction components 
are described in Section 3, and various embodiments of 
English transduction components are described in Section 5. 

Embodiments of a cross-lingual sense labeler and meth- 
ods for sense-labeling are described in Section 11. 

FIG. 2 illustrates the manner by which statistical transfer 
is incorporated into the traditional architecture of analysis, 
transfer, and synthesis. However, in other embodiments 
statistical transfer can be incorporated into other architec- 
tures. FIG. 3, for example, shows how two separate statis- 
tical transfer components can be incorporated into a French- 
to-English translation system which uses an interlingua. The 
French transducer 302 together with the statistical transfer 25 
module 304 translate French text 301 into an interlingua] 
representations 305. The other statistical transfer module 
306 together with the English transduction module 308 
translate the interlingual representations 305 into English 
text 309. 

The language models and translation models are statistical 
models with a large number of parameters. The values for 
these parameters are determined in a process called training 
from a large bilingual corpus consisting of parallel English- 
French text. For the training of the translation model, the 
English and French text is first aligned on a sentence by 
sentence basis. The process of aligning sentences is 
described in Sections 12 and 13. 
2 Translation Systems 

FIG. 4 depicts schematically a preferred embodiment of a 40 
batch translation system 401 which translates text without 
user assistance. The operation of the system involves three 
basic steps. In the first step 403 a portion of source text is 
captured from the input source document to be translated. In 
the second step 404, the portion of source text captured in 45 
step 403 is translated into the target language. In the third 
step 405, the translated target text is displayed, made avail- 
able to at least one of a listen and viewing operation, or 
otherwise made available. If at the end of these three steps 
the source document has not been fully translated, the 50 
system returns to step 403 to capture the next untranslated 
portion of source text 

FIG. 5 depicts schematically a preferred embodiment of a 
user-aided translation system 502 which assists a user 503 in 
the translation of text. The system is similar to the batch 
translation system 401 depicted FIG. 4. Step 505 measures, 
receives, or otherwise captures a portion of source text with 
possible input from the user. Step 506 displays the text 
directly to the user and may also output it to an electric 
media, made available to at least one of a listening and 
viewing operation or elsewhere. In the user-aided system 
502 the step 404 of the batch system 401 is replaced by step 
501 in which source text is translated into the target lan- 
guage with input from the user 503. In the embodiment 
depicted in FIG. 5, the system 502 is accompanied by a 
user-interface 504 which includes a text editor or the like, 
and which makes available to the user translation dictionar- 



30 



35 



55 



60 



65 



ies, thesauruses, and other information that may assist the 
user 503 in selecting translations and specifying constraints. 
The user can select and further edit the translations produced 
by the system. 

In step 505 in FIG. 5, the user-aided system, portions of 
source text may be selected by the user. The user might, for 
example, select a whole document, a sentence, or a single 
word. The system might then show the users possible 
translations of the selected portion of text. For example, if 
the user selected only a single word the system might show 
a ranked list of possible translations of that word. The ranks 
being determined by statisical models that would be used to 
estimate the probabilities that the source word translates in 
various manners in the source context in which the source 
word appears. 

In step 506 in FIG. 5 the text may be displayed for the user 
in a variety of fashions. A list of possible translations may 
be displayed in a window on a screen, for example. Another 
possibility is that possible target translations of source words 
may be positioned above or under the source words as they 
appear in the context of the source text. In this way the 
source text could be annotated with possible target transla- 
tions. The user could then read the source text with the help 
of the possible translations as well as pick and choose 
various translations. 

A schematic representation of a work station environ- 
ment, through which the a interfaces with a user-aided 
system is shown in FIG. 6. 

FIG. 7 depicts in more detail the step 404 of the batch 
translation system 401 depicted in FIG. 4. The step 404 is 
expanded into four steps. In the first step 701 the input 
source text 707 is transduced to one or more intermediate 
source structures. In the second step 702 a set of one or more 
hypothesized target structures are generated. This step 702 
makes use of a language model 705 which assigns prob- 
abilities or scores to target structures and a translation model 
706 which assigns probabilities or scores to source struc- 
tures given target structures. In the third step 703 the highest 
scoring hypothesis is selected. In the fourth step 704 the 
hypothesis selected in step 703 is transduced into text in the 
target language 708. 

FIG. 8 depicts in more detail the step 501 of the user-aided 
translation system 502 depicted in FIG. 5 (the user interface 
has been omitted for simplification). The step 501 is 
expanded into four steps. In the first step 801 the user 503 
is asked to supply zero or more translation constraints or 
hints. These constraints are used in subsequent steps of the 
translation process. 

In the second step 802 a set of putative translations of the 
source text is produced. The word putative is used here, 
because the system will produce a number of possible 
translations of a given portion of text. Some of these will be 
legitimate translations only in particular circumstances or 
domains, and others may simply be erroneous. In the third 
step these translations are displayed or otherwise made 
available to the user. In the fourth step 804, the user 503 is 
asked either to select one of the putative translations pro- 
duced in step 803, or to specify further constraints. If the 
user 503 chooses to specify further constraints the system 
returns to the first step 801. Otherwise the translation 
selected by the user 503 is displayed or otherwise made 
available for viewing or the like. 

FIG. 9 depicts in more detail the step 802 of FIG. 8 in 
which a set of putative translations is generated The step 
802 is expanded into four steps. In the first step 901, the 
input source text is transduced to one or more intermediate 
target structures. This step is similar to step 701 of FIG. 7 
except that here the transduction takes place in a manner 



5,477,451 



11 



12 



10 



20 



25 



30 



which is consistent with the constraints 906 which have been 
provided by the user 503. In the second step 902, a set of one 
or more hypotheses of target structures is generated This 
step is similar to step 702 except that here only hypotheses 
consistent with the constraints 906 are generated. In the third 
step 903 a set of the highest scoring hypothesis is selected. 
In the four step 904 each of the hypotheses selected in step 
903 is transduced into text in the target language. The 
sections of text in the target language which result from this 
step are the output of this module. 

A wide variety of constraints may be specified in step 801 
of FIG. 8 and used in the method depicted in FIG. 9. For 
example, the user 503 may provide a constraint which 
affects the generating of target hypotheses in step 902 by 
insisting that a certain source word be translated in a 
particular fashion. As another example, he may provide a 
constraint which affects the transduction of source text into 
an intermediate structure in step 901 by hinting that a 
particular ambiguous source word should be assigned a 
particular part of speech. As other examples, the user 503 
may a) insist that the word 'transaction' appear in the target 
text; b) insist that the word 'transaction' not appear in the 
target text; c) insist that the target text be expressed as a 
question, or d) insist that the source word 'avoir* be trans- 
lated differently from the way it is was translated in the fifth 
putative translation produced by the system. In other 
embodiments which transduce source text into parse trees, 
the user 503 may insist that a particular sequence of source 
words by syntactically analyzed as a noun phrase. These are 
just a few of many similar types of examples. In embodi- 
ments in which case frames are used in the intermediate 
source representation, the user 503 may insist that a par- 
ticular noun phrase fill the donor slot of the verb 'to give*. 
These constraints can be obtained from a simple user 
interface, which someone skilled in the art of developing 
interfaces between users and computers can easily develop. 

Steps 702 of FIG. 7 and 902 of FIG. 9 both hypothesize 35 
intermediate target structures from intermediate source 
structures produced in previous steps. 

In the embodiments described here, the intermediate 
source and target structures are sequences of morphological 
units, such as roots of verbs, tense markers, prepositions and 
so forth. In other embodiments these structures may be more 
complicated. For example, they may comprise parse trees or 
case-frame representations structures. In a default condition, 
these structures may be identical to the original texts. In such 
a default condition, the transductions steps are not necessary 
and can be omitted. 

In the embodiments described here, text in the target 
language is obtained by transducing intermediate target 
structures that have been hypothesized in previous steps. In 
other embodiments, text in the target language can be 
hypothesized directly from source structures. In these other 
embodiments steps 704 and 904 are not necessary and can 
be omitted. 

An example of a workstation environment depicting a 
hardware implementation using the translation system 
(application) in connection with the present invention is 
shown in FIG. 10. A computer platform 1014 includes a 
hardware unit 1004, which includes a central processing unit 
(CPU) 1006, a random access memory (RAM) 1005, and an 
input/output interface 1007. The RAM 1005 is also called 
main memory. 

The computer platform 1014 typically includes an oper- 
ating system 1003. A data storage device 1002 is also called 
a secondary storage and may include hard disks and/or tape 
drives and their equivalents. The data storage device 1002 
represents non-volatile storage. The data storage 1002 may 
be used to store data for the language and translation models 
components of the translation system 1001. 



40 



45 



50 



55 



60 



65 



Various peripheral components may be connected to the 
computer platform 1014, such as a terminal 1012, a micro- 
phone 1008, a keyboard 1013, a scanning device 1009, an 
external network 1010, and a printing device 1011. A user 
503 may interact with the translation system 1001 using the 
terminal 1012 and the keyboard 1013, or the microphone 
1008, for example. As another example, the user 503 might 
receive a document from the external network 1010, trans- 
late it into another language using the translation system 
1001, and then send the translation out on the external 
network 1010. 

In a preferred embodiment of the present invention, the 
computer platform 1014 includes an IBM System RS/6000 
Model 550 workstation with at least 128 Megabytes of 
RAM. The operating system 1003 which runs thereon is 
IBM AIX (Advanced Interactive Executive) 3.1. Many 
equivalents to the this structure will readily become apparent 
to those skilled in the art. 

The translation system can receive source text in a variety 
of known manners. The following are only a few examples 
of how the translation system receives source text (e.g. data), 
to be translated. Source text to be translated may be directly 
entered into the computer system via the keyboard 1013. 
Alternatively, the source text could be scanned in using the 
scanner 1009. The scanned data could be passed through a 
character recognizer in a known manner and then passed on 
to the translation system for translation. Alternatively, the 
user 503 could identify the location of the source text in 
main or secondary storage, or perhaps on removable sec- 
ondary storage (such as on a floppy disk), the computer 
system could retrieve and then translate the text accordingly. 
As a final example, with the addition of a speech recognition 
component, it would also be possible to speak into the 
microphone 1008, have the speech converted into source 
text by the speech recognition component. 

Translated target text produced by the translation appli- 
cation running on the computer system may be output by the 
system in a variety of known manners. For example, it may 
be displayed on the terminal 1012, stored in RAM 1005, 
stored data storage 1002, printed on the printer 1011, or 
perhaps transmitted out over the external network 1010. 
With the addition of a speech synthesizer it would also be 
possible to convert translated target text into speech in target 
language. 

Step 403 in FIG. 4 and step 505 in FIG. 5 measure, receive 
or otherwise capture a portion of source text to be translated. 
In the context of this invention, text is used to refer to 
sequences of characters, formatting codes, the typographical 
marks. It can be provided to the system in a number of 
different fashions, such as on a magnetic disk, via a com- 
puter network, as the output of an optical scanner, or of a 
speech recognition system. In some preferred embodiments, 
the source text is captured a sentence at a time. Source text 
is parsed into sentences using a finite-state machine which 
examines the test for such things as upper and lower case 
characters and sentence terminal punctuations. Such a 
machine can easily be constructed by someone skilled in the 
art In other embodiments, text may be parsed into units such 
as phases or paragraphs which are either smaller or larger 
than individual sentences. 
3 Source Transducers 

In this section, some embodiments of the source-trans- 
ducer 701 will be explained. The role of this transducer is to 
produce one or more intermediate source-structure repre- 
sentations from a portion of text in the source language. 



5,477, 

13 

3.1 Overview 

An embodiment of the source-transducer 701 is shown in 
FIG. 11. In this embodiment, the transducer takes as input a 
sentence in the source language and produces a single 
intermediate source-structure consisting of a sequence of 5 
linguistic morphs. This embodiment of the source-trans- 
ducer 701 comprises transducers that: 

tokenize raw text 1101; 

determine words from tokens 1102; 

annotate words with parts of speech 1103; to 

perform syntactic transformations 1104; 

perform morphological transformations 1105; 

annotate morphs with sense labels 1106. 

It should be understood mat FIG. 11 represents only one 
embodiment of the source-transducer 701. Many variations 15 
are possible. For example, the transducers U01, 1102, 1103, 
1104, 1105, 1106 may be augmented and/or replaced by 
other transducers. Other embodiments of the source-trans- 
ducer 701 may include a transducer that groups words into 
compound words or identifies idioms. In other embodi- 20 
ments, rather than a single intermediate source-structure 
being produced for each source sentence, a set of several 
intermediate source-structures together with probabilities or 
scores may be produced. In such embodiments the trans- 
ducers depicted in FIG. 11 can be replaced by transducers 25 
which produce at each stage intermediate structures with 
probabilities or scores. In addition, the intermediate source- 
structures produced may be different. For example, the 
intermediate structures may be entire parse trees, or case 
frames for the sentence, rather than a sequence of morpho- 30 
logical units. In these cases, there may be more than one 
intermediate source-structure for each sentence with scores, 
or there may be only a single intermediate source-structure. 

3.2 Components 

Referring still to FIG. 11, the transducers comprising the 35 
source-transducer 701 will be explained. For concreteness, 
these transducers will be discussed in cases in which the 
source language is either English or French. 
3.2.1 Tokenizing Transducers 

The purpose of the first transducer 1101, which tokcnizes 40 
raw text, is well illustrated by the following Socratic dia- 
logue: 

How do you find words in text? 

Words occur between spaces. 
What about "however,"? Is that one word or two? 45 

Oh well, you have to separate out the commas. 
Periods too? 

Of course. 
What about "Mr."? 

Certain abbreviations have to be handled separately. 50 
How about "shouldn't"? One word or two? 

One. 

So "shouldn't" is different from "should not"? 
Yes. 

And "Gauss-Bonnet", as in the "Gauss-Bonnet Theo- 55 
rem"? 

Two names, two words. 
So if you split words at hyphens, what do you do with 
"vis-a-vis"? 

One word, and don't ask me why. 60 
How about "stingray"? 

One word, of course. 
And "manta ray"? 

One word: it's just like stingray. 
But there's a space. 65 

Too bad. 
How about "inasmuch"? 



14 

Two. 
Are your sure? 
No. 

This dialogue illustrates that there is no canonical way of 
breaking a sequence of characters into tokens. The purpose 
of the transducer 1101, that tokenizes raw text, , is to make 
some choice. 

In an embodiment in which the source-language is 
English, this tokenizing transducer 1101 uses a table of a few 
thousand special character sequences which are parsed as 
single tokens and otherwise treat spaces and hyphens as 
word separators. Punctuation marks and digits are separated 
off and treated as individual words. For example, 87, is 
tokenized as. the three words 8 7 and ,. 

In another embodiment in which the source-language is 
French, the tokenizing transducer 1101 operates in a similar 
way. In addition to space and hyphen, the symbol -t- is 
treated as a separator when tokenizing French text. 

3.2.2 Token- Word Transducers 

The next transducer 1102 determines a sequence of words 
from a sequence of token spellings. In some embodiments, 
this transducer comprises two transducers as depicted in 
FIG, 12. The first of these transducers 1201 determines an 
underlying case pattern for each token in the sequence. The 
second of these transducers 1202 performs a number of 
specialized transformations. These embodiments can easily 
be modified or enhanced by a person skilled in the art. 

3.2.3 True-Case Transducers 

The purpose of a true-case transducer 1201 is made 
apparent by another Socratic dialogue: 

When do two sequences of characters represent the same 
word? 

When they are the same sequences. 
So, "the" and "The" are different words? 
Don't be ridiculous. You have to ignore differences in 
case. 

So "Bill" and "bill" are the same word? 
No. "Bill" is a name and 'Trill" is something you pay. 
With proper names the case matters. 
What about the two "May"'s in "May I pay in May?" 
The first one is not a proper name. It is capitalized 
because it is the first word in the sentence. 
Then, how do you known when to ignore case and when 
not to? 

If you are human, you just know. 

Computers don't know when case matters and when it 
doesn't Instead, this determination can be performed by a 
true-case transducer 1201. The input to this transducer is a 
sequence of tokens labeled by a case pattern that specified 
the case of each letter of the token as it appears in printed 
text These case patterns are corrupted versions of true-case 
patterns that specify what the casing would be in the absence 
of typographical errors and arbitrary conventions (e.g„ capi- 
talization at the beginning of sentences). The task of the 
true-case transducer is to uncover the true-case patterns. 

In some embodiments of the true-case transducer the case 
and true-case patterns are restricted to eight possibilities 



L + IT U1+ ULUL + ULLUL+ UUL + UUUI+ LUL+ 



Here U denotes an upper case letter, L a lower case letter, U* 
a sequence of one or more upper case letters, and L + a 
sequence of one or more lower case letters. In these embodi- 
ments, the true-case transducer can determine a true-case of 
an occurrence of a token in text using a simple method 
comprising the following steps: 



5,477,451 



15 



16 



1. Decide whether the token is part of a name. If so, set 
the true-case equal to the most probable true-case 
beginning with a U for that token. 

2. If the token is not part of a name, then check if the token 
is a member of a set of tokens which have only one 
true-case. If so, set the true-case appropriately. 

3. If the true-case has not been determined by steps 1 or 
2, and the token is the first token in a sentence then set 
the true-case equal to the most probable true-case for 
that token. 

4. Otherwise, set the true-case equal to the case for that 
token. 

In an embodiment of the true-case transducer used for 
both English and French, names are recognized with a 
simple finite-state machine. This machine employs a list of 
12,937 distinct common last names and 3,717 distinct com- 
mon first names constructed from a list of 125 million full 
names obtained from the IBM online phone directory and a 
list of names purchased from a marketing corporation. It also 
uses lists of common precursors to names (such as Mr,, Mrs., 
Dr., Mile., etc.) and common followers to names (such as Jr., 
Sr., m, etc). 

The set of tokens with only one true-case consists of all 
tokens with a case-pattern entropy of less than 1.25 bits, 
together with 9,506 Number of records in (o+p)8260.1cwtab 
English and 3,794 records in r8260.coerce(a+b+c) French 
tokens selected by hand from the tokens of a large bilingual 
corpus. The most probable case pattern for each English 
token was determined by counting token-case cooccurences 
in a 67 million word English corpus, and for each French 
token by counting cooccurences in a 72 million word French 
corpus. (In this counting occurrences at the beginnings of 
sentences are ignored). 

3.2.4 Specialized Transformation Transducers 
Referring still to FIG. 12, the transducer 1202 performs a 

few specialized transformations designed to systematize the 
tokenizing process. These transformations can include, but 
are not limited to, the correction of typographical errors, the 
expansion of contractions, the systematic treatment of pos- 
sessive, etc. In one embodiment of this transducer for 
English, contractions such as don't are expanded to do not, 
and possessives such as John's and nurses' are expanded to 
John 's and nurses '. In one embodiment of this transducer 
for French, sequences such as s'il, qu'avez, and j'adore are 
converted to pairs of tokens such as si il, que avez, and je 
adore. In addition, a few thousand sequences such as arm de 
are contracted to strings such as afin_de. These sequences 
are obtained from a list compiled by a native French speaker 
who felt that such sequences should be treated as individual 
words. Also the four strings au, aux, du, and des are 
expanded to a le, a les, de le, and de les respectively. 

3.2.5 Part-of- Speech Labeling Transducers 

Referring again to FIG. 11, the transducer 1103 annotates 
words with part-of-speech labels. These labels are used by 
the subsequent transducers depicted the figure. In some 
embodiments of transducer 1103, part-of-speech labels are 
assigned to a word sequence using a technique based on 
hidden Markov models. A word sequence is assigned the 
most probable part-of-speech sequence according to a sta- 
tistical model, the parameters of which are estimated from 
large annotated texts and other even larger un-annotated 
texts. The technique is fully explained in article by Bernard 
Merialdo entitled Tagging text with a Probabilistic Model' 
in the Proceedings of the International Conference on 
Acoustics, Speech, and Signal Processing, May 14-17, 
1991. This article is incorporated by reference herein. 



10 



In an embodiment of the transducer 1103 for tagging of 
English, a tag set consisting of 163 parts of speech is used. 
A rough categorization of these parts of speech is given in 
Table 1. 

In an embodiment of this transducer 1103 for the tagging 
of French, a tag set consisting of 157 parts of speech is used. 
A rough categorization of these parts of speech is given in 
Table 2. 

TABLE 1 



Parts of Speech for English 



20 



25 



30 



35 



40 



45 



50 



55 



60 



65 



29 Nouns 
27 Verbs 
20 Pronouns 
17 Determiners 
16 Adverbs 
12 Punctuation 
10 Conjunctions 
8 Adjectives 
4 Prepositions 
20 Other 



TABLE 2 



Parts-of-Speecfa for French 



105 Pronouns 

26 Verbs 

18 Auxiliaries 

12 Determiners 
7 Nouns 
4 Adjectives 
4 Adverbs 
4 Conjunctions 
2 Prepositions 
2 Punctuation 

12 Other 



3.2.6 Syntactic Transformation Transducers 

Referring still to FIG. 11, the transducer 1104, which 
performs syntactic transformations, will be described. 

One function of this transducer is to simplify the subse- 
quent morphological transformations 1105. For example, in 
morphological transformations, verbs may be analyzed into 
a morphological unit designating accidence followed by 
another unit designating the root of the verb. Thus in French, 
the verb ira might be replaced by 3s_future_indicative 
aller, indicating that ira is the third person singular of the 
future tense of aller. The same kind of transformation can be 
performed in English by replacing the sequence will go by 
ruture_indicative to_go. Unfortunately, often the two 
words in such a sequence are separated by intervening words 
as in the sentences: 

will be go play in the traffic? 

he will not go to church. 
Similarly, in French the third person of the English verb 
went is expressed by the two words est all6, and these two 
words can be separated by intervening words as in the 
sentences; 

est-t-il all£? 

II n'est pas alle\ 
It is possible to analyze such verbs morphologically with 
simple string replacement rules if various syntactic trans- 
formations that move away intervening words are performed 
first. 

A second function of the syntactic transducer 1104 is to 
make the task presented to the statistical models which 
generate intermediate target-structures for an intermediate 
source-structure as easy as possible. This is done by per- 
forming transformations that make the forms of these struc- 
tures more similar. For example, suppose the source lan- 



5,477,451 



17 



18 



guage is English and the target language is French. English 
adjectives typically precede the nouns they modify whereas 
French adjectives typically follow them, lb remove this 
difference, the syntactic transducer 1104 includes a trans- 
ducer which moves French words labeled as adjectives to 
positions proceeding the nouns which they modify. 

These transducers only deal with the most rudimentary 
linguistic phenomena. Inadequacies and systematic prob- 
lems with the transformations are overcome by the statistical 
models used later in the target hypothesis-generating module 
702 of the invention. It should be understood that in other 
embodiments of the invention more sophisticated schemes 
for syntactic transformations with different functions can be 
used. 

The syntactic transformations performed by the trans- 
ducer 1104 are performed in a series of steps. A sequence of 
words which has been annotated with parts of speech is fed 
into the first transducer. This transducer outputs a sequence 
of words and a sequence of parts of speech which together 
serve as input to the second transducer in the series, and so 
forth. The word and part-of-speech sequences produced by 
the final transducer are the input to the morphological 
analysis transducer. 

3.2.7 Syntactic Transducers for English 

FIG. 13, depicts an embodiment of a syntactic transducer 
1104 for English. Although in much of this document, 
examples are described in which the source language is 
French and the target language is English, here for reasons 
of exposition, an example of a source transducer for a source 
language of English is provided. In the next subsection, 
another example in which the source language is French is 
provided. Those with a basic knowledge of linguistics for 
other language will be able to construct similar syntactic 
transducer for those languages. 

The syntactic transducer in FIG. 13 is comprised of three 
transducer that 

perform question inversion 1301; 

perform do-not coalescence 1302; 

perform adverb movement 1303. 

To understand the function of the transducer 1301 that 
performs question inversion, note that in English questions 
the first auxiliary verb is often separated by a noun phrase 
from the root of the verb as in the sentences: 

does that elephant eat? 

which car is he driving? 
The transducer 1301 inverts the auxiliary verb with the 
subject of the sentence, and then converts the question mark 
to a special QINV marker to signal that this inversion has 
occurred For example, the two sentences above are con- 
verted by this transducer to: 

that elephant eats QINV 

which car he is driving QINV 
This transducer also removes supporting do's as illustrated 
in the first sentence. 

lb understand the function of the transducer 1302 that 
performs do-not coalescence, note that in English, negation 
requires the presence of an auxiliary verb. When one doesn't 
exist, an inflection of to do is used. The transducer 1302 
coalesces the form of to do with not into the string do__noL 
For example, 



10 



15 



to do in the input. When adverbs intervene in emphatic 
sentences the do_not is positioned after the intervening 
adverbs: 



John does really not like turnips. 
— s . John really do_nol like turnips. 

To understand the function of the final transducer 1303 
that performs adverb movement, note that in English, 
adverbs often intervene between a verb and its auxiliaries. 
The transducer 1303 moves adverbs which intervene 
between a verb and its auxiliaries to positions following the 
verb. The transducer appends a number onto the adverbs it 
moves to record the positions from which they are moved. 
For example, 



20 



25 



30 



35 



40 



45 



50 



Iraq will probably not be completely t 
— Iraq will_be_en to_balkanize probably_M2 noL_M2 
completely__M3. 

An M2 is appended to both probably and to not to indicate 
that they originally preceded the second word in the verbal 
sequence will be balkanized. Similarly, an M3 is appended 
to completely to indicate that it preceded the third word in 
the verbal sequence. 

The transducer 1303 also moves adverbs that precede 
verbal sequences to positions following those sequences: 



John really eats like a hog. 
— >y John eats rcally_Ml like a bog. 

This is done in order to place verbs close to their subjects. 
3.2.8 Syntactic Transducers for French 

Referring now to FIG. 14, an embodiment of the syntactic 
transducer 1104 for French is described. This embodiment 
comprises four transducers that 

perform question inversion 1401; 

perform discontinuous negative coalescence 1402; 

perform pronoun movement 1403; 

perform adverb and adjective movement 1104. 

The question inversion transducer 1401 undoes French 
question inversion much the same way that the transducer 
1301 undoes English question inversion: 



55 



60 



manger- vous des legumes? 
— k , vOTis mangez des legumes QINV 

cu habite-t-il? 

cu il habite QINV 

le lui avez-vous donn£? 
-jl vans le lui avez donndQINV 

This transducer 1401 also modifies French est-ce que ques- 
tions: 



John does not like turnips. 
—K . John do_not like turnips. 



65 



The part of speech assigned to the main verb, like above, is 
modified by this transducer to record the tense and person of 



est-ce qu'il mange comme un cocbon? 
^ il mange comme un cocbon EST-CE_QUE 



5,477,451 



19 



20 



To understand the function of the transducer 1402, which 
performs negative coalescence, note that propositions in 
French are typically negated or restricted by a pair of words, 
ne and some other word, that surround the first word in a 
verbal sequence. The transducer 1402 moves the ne next to s 
its mate, and then coalesces the two into a new word: 



jc dc sais pas. 

jc sais nc_pas. 

Jean n'a jamais mange* 
— ^ Jean a nc jamais mange" 

0 n'y en a plus. 
— k. il y en a ne_pms. 



conunc un cochon. 



10 



15 



To understand the function of the transducer 1403, which 
performs pronoun movement, note that in French, direct- 
object, indirect-object and reflexive pronouns usually pre- 
cede verbal sequences. The transducer 1403 moves these 
pronouns to positions following these sequences. It also 20 
maps these pronouns to new words that reflect their roles as 
direct-object or indirect-objects, or reflexive pronouns. So, 
for example, in the following sentence le is converted to le 
DPRO because it functions as a direct object and vous to 
vous IPRO because it functions as an indirect object: 25 



je vous le donnerai. 
r^. je donnerai le_DPRO vous_JPRO. 



30 



In the next sentence, vous is tagged as a reflexive pronoun 
and therefore converted to 



35 

vous__RPRO. vous vous lavez les mains. 

—K vous lavez vous_RPRO les mains. 

The allative pronominal clitic y and the ablative pronomi- 
nal clitic en are mapped to the two- word tuples a y PRO and 40 
de en PRO: 



je y pens eraa. 
— »i je penserai a y_PRO. 

j'en ai plus. 
— k. je ai plus de en_PRO. 



45 



The final transducer 1404 moves adverbs to positions 
following the verbal sequences in which they occur. It also 50 
moves adjectives to positions preceding the nouns they 
modify. This is a useful step in embodiments of the present 
invention mat translate from French to English, since adjec- 
tives typically precede the noun in English. 
3.2.9 Morphological Transducer 55 
Referring again to FIG. 11, a transducer 1105, which 
performs morphological transformations, will be described. 
One purpose of this transducer is to make manifest in the 
intermediate source-structure representation the fraternity of 
different forms of a word This is useful because it allows for 60 
more accurate statistical models of the translation process. 
For example, a system that translates from French to English 
but does not use a morphological transducer can not benefit 
from the fact that sentences in which parte is translated as 
speaks provide evidence that parte should be translated as 65 
spoken. As a result, parameter estimates for rare words are 
inaccurate even when estimated from a very large training 



sample. For example, even in a sample from the Canadian 
Parlement of nearly 30 million words of French text, only 24 
of the 35 different spellings of single-word inflections of the 
verb parler actually occurred. 

A morphological transducer 1104 is designed to amelio- 
rate such problems. The output of this transducer is a 
sequence of lexical morphemes. These lexical morphemes 
will sometimes be referred to in this application as morpho- 
logical units or simply morphs. In an embodiment of trans- 
ducer 1104 used for English, inflection morphological trans- 
formations are performed that make evident common origins 
of difference conjugations of the same verb; the singular and 
plural forms of me same noun; and the comparative and 
superlative forms of adjectives and adverbs. In an embodi- 
ment of transducer 1104 used for French, morphological 
inflectional transformations are performed that make mani- 
fest the relationship between conjugations of the same verb; 
and forms of the same noun or adjective different in gender 
and number are performed. These morphological transfor- 
mations are reflected in the sequence of lexical morphemes 
produced. Hie examples below illustrate the level of detail 
in these embodiments of a morphological transducer 1104; 



be was eating the peas more quickly than L 
— S . he V_pasL_progressive to_eat the pea N__PLURAL 

quick er_^ADV than I. 

nous en mnngeons rarement. 
— y nous V_present_indicativc_lp manger rare ment_ADV 

de en_PRO 

fls sc sont laves les mains sales. 
— fls V_past_3p Uwer se_RPRO les sale main N_PLURAL 

3.Z10 Sense-Labelling Transducers 

Referring again to FIG. 11, the transducer 1106, which 
annotates a lexical morph sequence produced by the trans- 
ducer 1105 with part-of-speech labels, will be explained. 
Much of the allure of the statistical approach to transfer in 
machine translation is the ability of that approach to for- 
mally cope with the problem of lexical ambiguity. Unfortu- 
nately, statistical methods are only able to mount a success- 
ful attack on this problem when the key to disambiguating 
the translation of a word falls within the local purview of the 
models used in transfer. 

Consider, for example, the French work prendre. 
Although prendre is most commonly translated as to take, it 
has a number of other less common translations. A trigram 
model of English can be used to translate Je vais prendre la 
deecision as I will make the decision because the trigram 
make the decision is much more common than the trigram 
take the decision. However, a trigram model will not be of 
much use in translating Je vais prendre ma propre decision 
as I will make my own decision because in this case take and 
decision no longer fall within a single trigram. 

In the paper, "Word Sense Disambiguation using Statis- 
tical Methods" in the proceedings of the 29th Annual 
Meeting of the Association for Computational Linguistics, 
published in June of 1991 by the Association of Computa- 
tional Linguistics and incorporated by reference herein, a 
description is provided of a method of asking a question 
about the context in which , a word appears to assign that 
word a sense. The question is constructed to have high 
mutual information with the translation of that word in its 
context By modifying the lexical entries that appear in a 
sequence of morphemes to reflect the senses assigned to 
these entries, informative global information can be encoded 
locally and thereby made available to the statistical models 
used in transfer. 



5,477,451 



21 



Although the method described in the aforementioned 
paper assigns senses to words, the same method applies 
equally well to the problem of assigning senses to mor- 
phemes, and is used here in that fashion. This transducer 
1106, for example maps prendre to prendre 1 in the sentence 

Je vais prendre ma propre voiture. ps but to prendre 2 in 
the sentence 

Je vais prendre ma propre decision. 

It should be understood that other embodiments of the 
sense-labelling transducer are possible. For example, the 
sense-labelling can be performed by asking not just a single 
question about the context but a sequence of questions 
arranged in a decision tree. 

3.3 SOURCE-TRANSDUCERS WITH CONSTRAINTS 

In some embodiments, such as that depicted in FIG. 9, a 
source-structure transducer, such as that labelled 901, 
accepts a set of constraints that restricts its transformations 
source text to an intermediate target structure in source text. 

Such constraints include, but are not limited to, 

requiring that a particular phrase be translated as a certain 
linguistic component of a sentence, such a noun-phase; 

requiring that a source word be labelled as a certain 
part-of-speech such as a verb or determiner; 

requiring that a source word be morphologically analyzed 
a certain way; 

requiring that a source word be annotated with a particular 
sense label; 

in embodiments in which the intermediate structure 
encodes parse-tree or case-frame information, requir- 
ing a certain parse or case-frame structure for a sen- 
tence; morphologically analyzed in a particular way, or 
be annotated with a particular sense-label, 
A source-transducer accepting such constraints in similar 
to source transducers as described in this section. Based on 
the descriptions already given, such transducers can be 
constructed by a person with a computer science background 
and skilled in the art 



4 FINITE-STATE TRANSDUCERS 

This section provides a description of an embodiment of 
a mechanism by which the syntactic transductions in step 
1104 and the morphological transductions in step 1105 are 
performed. The mechanism is described in the context of a 
particular example depicted in 15.. One with a background 
in computer science and skilled in the art of producing 
finite-state transducers can understand from this example 
how to construct syntactic and morphological transducers of 
the type described above. 

The example transducer inverts questions involving do, 
does, and did. After steps 1101, 1102, and 1103, the source 
text Why don't you ever succeed? is transduced into parallel 
word and part-of-speech sequences 1501: 



10 



15 



20 



25 



30 



35 



40 



45 



50 



55 



60 



why 
RRQ 



do 
VD0 



XX 



you 
PPY 



RR 



succeed 
W0 



Here, RRQ and RR are adverb tags, VD0 and WO are 
verb tags, XX is a special tag for the word not, PPY is a 
pronoun tag, and ? is a special tag for a question mark. 



65 



22 



The example transducer converts these two parallel 
sequences to the parallel word and part-of-speech sequences 
1502: 



why 
RRQ 



you 
PPY 



succeed 
W0 



do__not_M0 
XX 



ever_Ml 
RR 



QINV 
QINV 



Here, QINV is a marker which records the fact that the 
original input sentence was question inverted. 

A mechanism by which a transducer achieves this trans- 
formation is depicted in FIG. 16, and is comprised of four 
components: 

a finite-state pattern matcher 1601; 

an action processing module 1603; 

a table of patterns 1602; 

a table of action specifications 1503. 

The transducer operates in the following way: 

1 . One or more parallel input sequences 1605 are captured 
by the finite-state pattern-matcher 1601; 

2. The finite-state pattern-matcher compares the input 
sequences against a table of patterns 1602 of input 
sequences stored in memory; 

3. A particular pattern is identified, and an associated 
action-code 1606 is transmitted to the action-process- 
ing module 1603; 

4. The action-processing module contains a specification 
of the transformation associated to this action code 
from a table of actions 1503 stored in memory; 

5. The action-processing module applies the transforma- 
tion to the parallel input streams to produce one or more 
parallel output sequences 1604; 

The parallel input streams captured by the finite-state 
pattern matcher 1601 are arranged in a sequence of attribute 
tuples. An example of such a sequence is the input sequence 
1501 depicted in FIG. 15. This sequence consists of a 
sequence of positions together with a set of one or more 
attributes which take values at the positions. A few examples 
of such attributes are the token attribute, the word attribute, 
the case-label attribute, the part-of-speech attribute, the 
sense-label attribute. The array of attributes for a given 
position will be called an attribute tuple. 

For example, the input attribute tuple sequence 1501 in 
FIG. 15 is seven positions long and is made up of two 
dimensional attribute tuples. The first component of an 
attribute tuple at a given position refers to the word attribute. 
This attribute specifies the spellings of the words at given 
positions. For example, the first word in the sequence 1501 
is why. The second component of an attribute tuple at a 
given position for this input sequence refers to a part-of- 
speech tag for that position. For example, the part of speech 
at the first position is RRQ. The attribute tuple at position 1 
is thus the ordered pair why, RRQ. 

The parallel output streams produced by the action pro- 
cessing module 1603 are also arranged as a sequence of 
attribute tuples. The number of positions in an output 
sequence may be different from the number of positions in 
an input sequence. 

For example, the output sequence 1502 in FIG. 15, 
consists of six positions. Associated with each position is a 
two-dimensional attribute tuple, the first coordinate of which 
is a word attribute and the second coordinate of which is a 
part-of-speech attribute. 

An example of a table of patterns 1602 is shown in FIG. 
17. This table is logically divided into a number of parts or 
blocks. ■ 



5,477,- 

23 

Pattern-Action Blocks. The basic definitions of the 
matches to be made and the actions to be taken are 
contained in pattern-action blocks. A pattern-action 
block comprises of a list of patterns together with the 
name of actions to be invoked when patterns in an input 5 
attribute-tuple sequence 1605 are matched. 
Auxiliary Pattern Blocks. Patterns that can be used as 
sub-patterns in the patterns of pattern-action blocks are 
defined in Auxiliary Pattern blocks. Such blocks con- 1Q 
tain lists of labelled patterns of of attributes tuples. 
These labelled patterns do not have associated actions, 
but can be referenced by their name in the definitions 
of other patterns. 
In FIG. 17 there is one Auxiliary Pattwern block. This 15 
block defines four auxiliary patterns. The first of these 
has a name ADVERB and matches single tuple adverb- 
type constructions. The second has a name of BARE 
NP and matches certain noun-phrase-type construc- 
tions. Notice that this auxiliary pattern makes use of the 20 
ADVERB pattern in its definition. The third and fourth 
auxiliary patterns match other types of noun phrases. 
Set Blocks. Primary and auxiliary patterns allow for sets 
of attributes. In FIG. 17, for example, there is a set 
called DO SET, of various forms of to do, and another 25 
set PROPER NOUN TAG of proper-noun tags. 
Patterns are defined in terms of regular expressions of 
attribute tuples. Any pattern of attribute tuples that can be 
recognized by a deterministic finite-state automata can be 
specified by a regular expression. The language of regular 30 
expressions and methods for constructing finite-state 
automata are well known to those skilled in computer 
science. A method for constructing a finite-state partem 
matcher from a set of regular expressions is described in the 
article "LEX — A lexical Analyzer Generator" written by 35 
Michael E. Lesk, and appearing in the Bell Systems Tech- 
nical Journal, Computer Science Technical Report Number. 
39, published in October of 1975. 

Regular expressions accepted by the pattern matcher 1601 
are described below. 40 
Regular Expressions of Attribute Tuples: A regular 
expression of attribute tuples is a sequence whose 
elements are either 

1 . an attribute tuple; 

2. the name of an auxiliary regular expression; or 45 

3. a register name. 

These elements can be combined using one of the logical 
operations: 



Operator 


Meaning 


Usage 


Matches 




concatenation 


A.B 


A followed by B 


1 


onion (i.e. or) 


MB 


A or B 


* 


0 or more 


A* 


0 or mote A's 


? 


Oor 1 


A? 


0 or 1 A's 


+ 


■ 1 or more 


A+ 


1 or more A's 



Here A and B denote other regular expressions. 
Examples of these logical operations are: 



24 



-continued 




(AIB).C A or B, followed by C 



Attribute Tuples: The most common type of element in a 
regular expression is an attribute tuple. An attribute 
tuple is a vector whose components are either 

1. an attribute (as identified by its spelling); 

2. a name of a set of attributes; 

3. the wild card attribute. 

These elements are combined using the following opera- 
tors: 



Operator 


Meaning 


Usage 




Delimiter between coordina 


tea a,b 




of an attribute tuple 






Negation 


-a 




Wild Card 





(Here a and b denote attribute spellings or attribute set 
names). 

The meanings of these operators are best illustrated by 
example. Let a, b, and c denote either attribute spellings 
or set names. Assume the dimension of the attribute 
tuples is 3. Then: 



Attribute 

Tuple Matches 



a,b,c 


First attribute matches a, second match b, 




flfljlj XQfllCilCS c 


,b,c 


First attribute elided (matches anything), 




Second attribute matches b, third matches c 


,b, 


First and third attribute elided (match arjytbing) 




Second attribute matches b 


a 


Second and third attributes elided (Match anything) 




Fust matches a 


#,b. 


Fust attribute wild-card (i.e matches anything) 




Second attribute matches b. Third attribute elided 


a,"b,"c 


Second attribute matches anything EXCEPT b. Third 




matches anything EXCEPT c. 



Auxiliary Regular Expressions: A second type of element 
in a regular expression is an auxiliary regular expres- 
sion. An auxiliary regular expression is a labelled 
regular expression which is used as a component of a 
larger regular expression. 

Logically, a regular expression involving auxiliary regular 
expressions is equivalent to the regular expression 
obtained by resolving the reference to the auxiliary 
pattern. For example, suppose an auxiliary regular 
expression named D has been defined by: 

where A,B denote attribute tuples (or other auxiliary 
patterns). Then: 



Expression Matches ■ - ■ ■ — 

^— ^— ^— — — — Expression is equivalent to 

A7.B.C 0 or 1 A*s followed by B then by C M 

(A*)I(B+) 0 or more A's or 1 or more B's CD CA.B+.A* 



5,477,451 





25 




-continued 


Expression 


is equivalent to 


Df.CD 


(A.B+.A*H-A.B+.A*C.A*.B+.A* 



26 



Registers: Just knowing that a regular expression matches 
an input attribute tuple sequence usually does not 
provide enough information for the construction of an 
appropriate output attribute tuple sequence. Data is 
usually also required about the attribute tuples matched 
by different elements of the regular expression. In 
ordinary LEX, to extract this type of information often 
requires the matched input sequence to be parsed again. 
To avoid this cumbersome approach, the pattern- 
matcher 1601 makes details about the positions in the 
input stream of the matched elements of the regular 
expression more directly available. From these posi- 
tions, the identities of the attribute tuples can then be 
determined. 

Positional information is made available through the use 
of registers. A register in a regular expression does not 
match any input Rather, 

1 . After a match, a register is set equal to the position 
in the input sequence of the next tuple in the input 
sequence that is matched by an element of the 
regular expression to the right of the register. 

2. If no further part of the regular expression to the right 
of the register matches, then the register is set equal 
to zero, 

The operation of registers is best illustrated by some 
examples. These examples use registers [1] and [2]: 



10 



15 



20 



25 



30 



tions involving forms of do. An instance of this action is 
shown in FIG. 15. In the pseudo-code of FIG. 18 for this 
action, the symbol @reg(i) denotes the contents of register 
i. In line 6 of this pseudo-code, the output attribute tuple 
sequence is set to null. 

A question matched by the regular expression in lines 3-4 
may or may not begin with a (so-called) wh- word in the set 
WH NP. If it does match, the appropriate action is to append 
the input tuple in the first position to the output sequence. 
This is done in lines 8-9. 

After the wh- word, the next words of the output sequence 
should be the subject noun phrase of the input sequence. 
This is made so in line 11-12 that appends all tuples 
matching the regular expression SUBJECT NP to the output 
sequence. 

For negative questions involving forms of do, the part- 
of-speech tag of the output verb and of the output adverbs 
are the same as those of the input verb and adverbs. ITius the 
entire input tuple sequences corresponding to these words 
can be appended to the output. This is done in lines 15-18. 

For positive questions the tag attribute of the output verb 
may be different than that of the input verb. This is handled 
in lines 25-37. The input word attribute for the verb is 
appended to the output word attribute in lines 26 and 31 and 
35. The output tag attribute is selected based on the form of 
do in the input sequence. Explicit tag values are appended to 
the output sequence in lines 32 and 37. 

The remaining input words and tags other than the ques- 
tion mark are written to the output sequence in lines 43-44. 
The input sequence is completed in line 46 by the marker 
QINV signalling question inversion, together with the 
appropriate tag. 



Contents of Registers after match 



A.U1.B.C 

A.|2].(CID) 

A.|1J.B*[2].C 



A.I1].B*.C* 



Reg 1 : First position of B match 

Reg 2: First position of either C or D match 

Reg 1: If B matches: First position of B match 

Otherwise: First position of C match 
Reg 2: First position of C match 
Reg 1: If B matches: First position of B match 

If C matches: First position of C match 

Otherwise: 0 
Reg 2: If C matches: First position of C match 

Otherwise: 0 



35 



40 



45 



A pattern-action block defines a pattern matcher. When an 
input attribute-tuple sequence is presented to the finite-state 
pattern matcher a current input position counter is initialized 
to 1 denoting that the current input position is the first 50 
position of the sequence. A match at the current input 
position is attempted for each pattern. If no pattern matches, 
an error occurs. If more than one pattern matches, the match 
of the longest length is selected. If several patterns match of 
the same longest length, the one appearing first in the 55 
definition of the pattern-action block is selected. The action 
code associated with that pattern is then transmitted to the. 
action processing module 1603. 

Transformations by the action processing module are 
defined in a table of actions 1503 which is stored in memory. 60 
The actions can be specified in specified in any one of a 
number of programming languages such as C, PASCAL, 
FORTRAN, or the like. 

In the question-inversion example, the action specified in 
the pseudo-code in FIG. 18 is invoked when the pattern 65 
defined by the regular expression in lines 3-4 is matched. 
This action inverts the order of the words in certain ques- 



5 TARGET TRANSDUCERS 

In this section, some embodiments of the the target- 
structure transducer 704 will be explained. The role of this 
transducer is to produce portions of text in the target 
language from one or more intermediate target-structures. 

5.1 OVERVIEW 

An embodiment of a target-structure transducer 704 for 
English is shown in FIG. 19. In this embodiment, the 
transducer 704 converts an intermediate target-structure for 
English consisting of a sequence of linguistic morphs into a 
single English sentence. This embodiment performs the 
inverse transformations of those performed by the source 
transducer 702 depicted in FIG. 13. That is, if the transducer 
702 is applied to an English sentence, and then the trans- 
ducer 704 is applied to the sequence of linguistic morphs 
produced by the transducer 702 the original English sen- 
tence is recovered. 

This embodiment of the target-transducer 704 comprises 
a sequence of transducers which: 

generate nouns, adverbs and adjectives from morphologi- 
cal units 1901; 

provide do-support for negatives 1902; 

separate compound verb tenses 1903; 

provide do-support for inverted questions 1904; 

conjugate verbs 1905; 

reposition adverbs and adjectives 1906; 

restore question inversion 1907; 

resolve ambiguities in the choice of target words resulting 
from ambiguities in the morphology 1908; and 

case the target sentence 1909. 



5,477,451 



27 



28 



It should be understood that FIG. 19 represents only one 
possible embodiment of the target-transducer 704. Many 
variations are possible. For example, in other embodiments, 
rather than a single target sentence being produced for each 
intermediate target-structure, a set of several target sen- 
tences together with probabilities or scores may be pro- 
duced. In such embodiments, the transducers depicted in 
FIG. 19 can be replaced by transducers which produce at 
each stage several target sentences with probabilities or 
scores. Moreover, in embodiments of the present invention 
in which the intermediate target-structures are more sophis- 
ticated than lexical morph sequences, the target-structure 
transducer is also more involved. For example, if the inter- 
mediate target-structure consists of a parse tree of a sentence 
or case frames for a sentence, then the target-structure 
transducer converts these to the target language. 

5.2 COMPONENTS 

The transducers depicted in FIG. 19 will now be 
explained. The first seven of these transducers 1901 1902 
1903 1904 1905 1906 1907 and the last transducer 1909 are 
implemented using a mechanism similar to that depicted in 
FIG. 12 and described in subsection Finite-State Transduc- 
ers. This is the same mechanism that is used by the syntactic 
transducers 1104 and the morphological transducers 1105 
that are part of the source-transducer 701. Such transducers 
can be constructed by a person with a background in 
computer science and skilled in the art The construction of 
the transducer 1908 will be explained below. 

The first transducer 1901 begins the process of combining 
lexical morphs into words. It generates nouns, adverbs and 
adjectives from their morphological constituents. For 
example, this transducer performs the conversions: 



the position of the adverb (or adjective) from the head of the 
verb phrase. These markers are correctly updated by the 
transformation. 

The next transducer 1903 provides do-support for 
inverted questions by inserting the appropriate missing form 
of the to do into lexical morph sequences corresponding to 
inverted questions. This is typically inserted between the 
verb tense marker and the verb. Thus, for example, if 
converts: 



10 



15 



20 



25 



30 



John V_prc$cnt_3s to_Jike to_eal QINV 
John V_pttscnt_3s tt>_do Kkc to_eat QINV 

The next transducer 1904 prepares the stage for verb- 
conjugation by breaking up complicated verb tenses into 
simpler pieces. 



John V__has_been_ing to_eal a lot recently 
► John V ^header has been V_juig to_cat a lot recently. 



This transducer also inserts a verb phrase header marker 
V header for use by later transducers. In particular the verb 
phrase header marker is used by the transducer 1907 that 
repositions adjective and adverb. 

The next transducer 1905 conjugates verbs by combining 
the root of the verb and the tense of the verb into a single 
word. This process is facilitated by the fact that complicated 
verb tenses have already been broken up into simpler pieces. 
For example, the previous sentence is transduced to: 



the ADJ_er big boy N_phiral V_bare io_cat rapid ADV_ly. 
— s . the bigger boys V_bare to_eat rapidly. 

he V_presenl__3s to_eat the pea N_PLURAL quick ADV_er than L 
— K . he V_present_3s to_eat the peas more quickly than 1. 



(The final sentences here are The bigger boys eat rapidly 
and He eats the peas more quickly than I.) 

There are sometimes ambiguities in generating adverbs 
and adjectives from their morphological units. For example, 
the lexical morph sequence quick ADV ly can be converted 45 
to either quick or quickly, as for example in either how quick 
can you fun? or how quickly can you run?. This ambiquity 
is encoded by the transducer 1901 by combining the differ- 
ent possibilities into a single unit. Thus, for example, the 
transducer 1901 performs the conversion: 50 



how can quick ADV_Jy Ml you V_bare to_ron ? 
—K how can quick_quickly Ml you V_bare to_run? 



The second transducer 1902 provides do-support for 
negatives in English by undoing the do-not coalescences in 
the lexical morph sequence. Thus, for example it converts: 



John V_present_3s to_eat do_not Ml rcail_Ml a lot. 
— John V_present_3s to_do eat not M2 really M2 a lot 



by splitting do from not and inserting to do at the head of 65 
the verb phrase. (The final sentence here 'is John does not 
really eat a lot.) The move-markers Ml, M2, etc, measure 



=>John V header has been eating a lot recently. 

The next transducer 1906 repositions adverbs and adjec- 
tives to their appropriate places in the target sentence. This 
is done using the move-markers Ml, M2, etc., that arc 
attached to adjectives. Thus for example, this transducer 
converts: 



Iraq V _header will be Balkamzed probably M2 not M2. 
— s Iraq V__header will probably not be Balkamzed 



The verb phrase header marker V header is kept for use by 
the next transducer 1907. 

The transducer 1907 restores the original order of the 
noun and verb in inverted questions. 

These questions are identified by the question inversion 
marker QINV. In most cases, to restore the inverted question 
order, the modal or verb at the head of the verb phrase is 
moved to the beginning of the sentence: 



John V_header was here yesterday QINV 
was John here yesterday? 



5,477,451 



29 



30 



If the sentence begins with one of the words what, who, 
where, why, whom, how, when, which the modal or verb is 
moved to the second position in the sentence; 



where John V_headcr was yesterday QINV 
— k. where was John yesterday? 

A special provision is made for moving the word not in 10 
inverted questions. The need to do this is signaled by a 
special marker MO in the lexical morph sentence. The not 
preceding the MO marker is moved to the position following 
the modal or verb; 

15 



be V_header can swim not MO QINV 
—k can not he swim? 

John V_hcadcr docs eat not MO potatoes? 
— k Does not John eat potatoes? 



20 



After the inverted question order has been restored, the 
question inversion marker QINV is removed and replaced 
by a question mark at the end of the sentence. 

The next transducer 1909 resolves ambiguities in the 25 
choice of words for the target sentence arising from ambi- 
guities in the process 1901 of generating adverbs and 
adjectives from their morphological units. 

For example, the transducer 1909 converts the sentence 



bow quick quickly can yon run? 

which contains the ambiguity in resolving quick ADV ly into 
a single adverb, into the sentence 

how quickly can yon run? 



30 



35 



To perform such conversions, the transducer 1909 uses a 
target-language model to assign a probability or score to 
each of the different possible sentences corresponding to an 40 
input sentence with a morphological ambiguity. The sen- 
tence with the highest probability or score is selected. In the 
above example, the sentence how quickly can you run? is 
selected because it has a higher target-language model 
probability or score than how quick can you run? In some 45 
embodiments of the transducer 1909 the target-language 
model is a Digram model similar to the target-structure 
language model 706. Such a transducer can be constructed 
by a person skilled in the art The last transducer 1910 
assigns a case to the words of a target sentence based on the so 
casing rules for English. Principally this involves capitaliz- 
ing the words at the beginning of sentences. Such a trans- 
ducer can easily be constructed by a person skilled in the ait 

5.3 TARGET TRANSDUCERS WITH CONSTRAINTS 

In some embodiments, such as that depicted in FIG. 9, a 55 
target-structure transducer, such as that labelled 904, accepts 
a set of constraints that restricts its transformations of an 
intermediate target structure to target texL 

Such constraints include, but are not limited to, requiring 
that a particular target word appear in the final target text. 60 
For example, because of ambiguities in a morphology, a 
target tranducer may not be able to distinguish between 
electric and electrical. Thus the output of a target transducer 
might include the phrase electrical blanket in a situation 
where electric blanket is preferred. A constraint such as The 65 
word "electric" must appear in the target sentence would 
correct this shortcoming. 



A target-transducer accepting such constraints in similar 
to target transducers as described in this section. Based on 
the descriptions already given, such transducers can be 
constructed by a person with a computer science background 
and skilled in the art. 

6 TARGET LANGUAGE MODEL 

The inventions described in this specification employ 
probabilistic models of the target language in a number of 
places. These include the target structure language model 
705, and the class language model used by the decoder 404. 
As depicted in FIG. 20, the role of a language model is to 
compute an a priori probability or score of a target structure. 

Language models are well known in the speech recogni- 
tion art. They are described in the article "Self-Organized 
Language Modeling for Speech Recognition", by F. Jelinek, 
appearing in the book Readings in Speech Recognition 
edited by A. Waibel and K. F. Lee and published by Morgan 
Kaufmann Publishers, Inc., San Mateo, Calif, in 1990. They 
are also described in the article "A Tree-Based Statistical 
Model for Natural Language Speech Recognition", by L. 
Bahl, et al., appearing in the July 1989 Volume 37 of the 
IEEE Transactions on Acoustics, Speech and Signal Pro- 
cessing. These articles are included by reference herein. 
They are further described in the paper 'Trainable Gram- 
mars for Speech Recognition", by J. Baker, appearing in the 
1979 Proceedings of the Spring Conference of the Acous- 
tical Society of America. 

In some embodiments of the present inventions;"the target 
structure consists of a sequence of morphs. In these embodi- 
ments, n-gram language models, as described in the afore- 
mentioned article by F. Jelinek, can be used. In other 
embodiments, the target structure comprises parse trees of 
the target language. In these embodiments, language models 
based on stochastic context-free grammars, as described in 
the aforementioned articles by F. Jelinek and the aforemen- 
tioned paper by J. Baker, can be used 

In addition, decision tree language models, as described in 
the aforementioned paper by L. Bahl, et al. can be adapted 
by one skilled in the art to model a wide variety of target 
structures. 

6.1 PERPLEXITY 

The performance of a language model in a complete 
system depends on a delicate interplay between the language 
model and other components of the system. One language 
model may surpass another as part of a speech recognition 
system but perform less well in a translation system. 

Since it is expensive to evaluate a language model in the 
context of a complete system, it is useful to have an intrinsic 
measure of the quality of a language model. One such 
measure is the probability that the model assigns to the large 
sample of target structures. One judges as better the lan- 
guage model which yields the greater probability. When the 
target structure is a sequence of words or morphs, this 
measure can be adjusted so that it takes account of the length 
of the structures. This leads to the notion of the perplexity of 
a language model with respect to a sample of text S: 



perplexity ea Pr (S) 



O) 



where ISI is the number of morphs of S. Roughly speaking, 
the perplexity is the average number of morphs which the 
model cannot distinguish between, in predicting a morph of 
S. The language model with the smaller perplexity will be 



5,477,451 



31 



the one which assigns the larger probability to S. 

Because perplexity depends not only on the language 
model but also on the sample of text, it is important that the 
text be representative of that for which the language model 
is intended. Because perplexity is subject to sampling error, 
making fine distinctions between language models may 
require that the perplexity be measured with respect to a 
large sample. 

6.2 n-GRAM LANGUAGE MODELS 

n-gram language models will now be described. For these 
models, the target structure consists of a sequence of mor- 
phs. 

Suppose m 1 m 2 m 3 . . . m k be a sequence of k morphs m,. 
For l^i^ljSik, let ra/ denote the subsequence m/^m^m^ . 
. . m,; For any sequence, the probability of a m^s equal to 
the product of the conditional probabilities of each morph m, 
given the previous morphs m,^ 1 : 

PKmi k ) ~ Pr{mi)PT(m2Jm\)Pr(m3>bni»*2) 



fc-l 

. Pr{mk\m\ ). 



For a vocabulary of size V, a 1-gram model is determined 
by V-l independent numbers, one probability Pr{m) for 
each morph m in the vocabulary, minus one for the con- 
straint that all of the probabilities add up to 1. A 2-gram 
model is determined by V 2 -l independent numbers, V(V-1) 
conditional probabilities of the form PrCnijIm,) and V-l of 
the form Pr(m). In general, an n-gram model is determined 
by V n -1 independent numbers, V^CV-l) conditional prob- 
abilities of the form Pr(m n lm 1 ' l ~ 1 ), called the order-n con- 
ditional probabilities, plus V" -1 -! numbers which deter- 
mine an (n~l) gram model. 

The order-n conditional probabilities of an n-gram model 
form the transition matrix of an associated Markov model. 
The states of this Markov model are sequences of n-1 
morphs, and the probability of a transition from the state 
m i m 2 • • • nVi to the state m 2 ra 3 . . . m„ is Pitm^n^n^ . . 
, m 7l _ l ). An n-gram language model is called consistent if, 
~ \ the probability that the model assigns 



for each string m, n 1 



10 



15 



The sequence m/" 1 is called the history of the morph m, 
in the sequence. 

For an n-gram model, the conditional probability of a 25 
morph in a sequence is assumed to depend on its history only 
through the previous n-1 morphs: 



(5) 



30 



35 



45 



to m^ 1 is the steady state probability for the state m x of 
the associated Markov model. 
6.3 SIMPLE n-GRAM MODELS 
The simplest form of an n-gram model is obtained by 
assuming that all the independent conditional probabilities 
are independent parameters. For such a model, values for the 
parameters can be determined from a large sample of 
training text by sequential maximum likelihood training. 
The order n-probabilities are given by 



Prim n \mT l ) = 



(6) 



60 



where fCm,*) is the number of times the string of morphs m/ 
appears in the training text. The remaining parameters are 
determined inductively by an analogous formula applied to 
the corresponding n-1 -gram model. Sequential maximum 65 
likelihood training does not produce a consistent model, 
although for a large amount of training text, it produces a 



32 



model that is very nearly consistent. 

Unfortunately, many of the parameters of a simple n-gram 
model will not be reliably estimated by this method. The 
problem is illustrated in Table 3, which shows the number of 
1-2-, and 3-grams appearing with various frequencies in a 
sample of 365,893,263 words of English text from a variety 
of sources. The vocabulary consists of the 260,740 afferent 
words plus a special unknown word into which all other 
words are mapped. Of the 6.799xl0 10 2-grams that might 
have occurred in the data, only 14,494,217 actually did 
occur and of these, 8,045,024 occurred only once each. 
Similarly, of the 1.733xl0 16 3-grams that might have 
occurred, only 75,349,888 actually did occur and of these, 
53,737,350 occurred only once each. These data and Tur- 
ing's formula imply that 14.7 percent of the 3-grams and for 
2.2 percent of the 2-grams in a new sample of English text 
will not appear in the original 

TABLE 3 





Number of n-grams 


( with various frequenc 


iesin 




365,893,263 


words of running text 




Count 


1- grams 


2~ grams 


3-grams 


1 


36,789 


8,045,024 


53,737350 


2 


20,269 


2.065,469 


9,229,958 


3 


13,123 


970,434 


3,653,791 


>3 


135335 


3.413,290' 


8,728.789 


>0 


205,516 


14,494,217 


75.349.888 


go 


260,741 


6.799 x 10 10 


1.773 x 10 16 



sample. Thus, although any 3-gram that does not appear in 
the original sample is rare, there are so many of them that 
their aggregate probability is substantial. 

Thus, as n increases, the accuracy of a simple n-gram 
model increases, but the reliability of the estimates for its 
parameters decreases. 

6.4 SMOOTHING 

A solution to this difficulty is provided by interpolated 
estimation, which is described in detail in the paper "Inter- 
polated estimation of Markov source parameters from sparse 
data", by F. Jelinek and R. Mercer and appearing in Pro- 
ceeding of the Workshop on Pattern Recognition in Practice, 
published by North-Holland, Amsterdam, The Netherlands, 
in May 1980. Interpolkated estimation combines several 
models into a smoothed model which uses the probabilities 
of the more accurate models where they are reliable and, 
where they are unreliable, falls back on the more reliable 
probabilities of less accurate models. If Pr^m,! m/" 1 ) is the 
jth language model, the smoothed model, PrCn^lm/" 1 ), is 
given by 



50 



55 



(7) 



The values of the X^m,*" 1 ) are determined using the EM 
method, so as to maximize the probability of some addi- 
tional sample of training text called held-out data. When 
interpolated estimation is used to combine simple 1-, 2-, and 
3-gram models, the Vs can be chosen to depend on m^ 1 
only through the count of m^m^ . Where this count is 
high, the simple 3-gram model will be reliable, and, where 
this count is low, the simple 3-gram model will be unreli- 
able. 

The inventors constructed an interpolated 3-gram model 
in which the Vs were divided into 1782 different sets 
according to the 2-gram counts, and determined from a 
held-out sample of 4,630,934 million words. The power of ' 
the model was tested using the 1,014,312 word Brown 
corpus. This well known corpus, which contains a wide 



5,477,451 



33 



variety of English text, is described in the book Computa- 
tional Analysis of Present-Day American English, by H. 
Kucera and W. Francis, published by Brown University 
Press, Providence, R.I., 1967. The Brown corpus was not 
included in either the training or held-out data used to 5 
construct the model. The perplexity of the interpolated 
model with respect to the Brown corpus was 244. 

6.5 n-GRAM CLASS MODELS 

Clearly, some words are similar to other words in their 
meaning and syntactic function. For example, the probabil- 10 
ity distribution of words in the vicinity of Thursday is very 
much like that for words in the vicinity of Friday. Of course, 
they will not be identical: people rarely say Thank God it's 
Thursday! or worry about Thursday the 13'*. 

In class language models, morphs are grouped into 15 
classes, and morphs in the same class are viewed as similar. 
Suppose that t is a map that partitions the vocabulary of V 
morphs into C classes by assigning each morph m to a class 
£(m). An n-gram class model based on £ is an n-gram 
language model for which 20 



Pt{m k \m l i ~ i > : P'i'*klcJPT{c k \c l > - t ) 



(8) 



where cpcXmJ. An n-gram class model is tetermined by 
CM+V-C independent numbers, V-C of the form 2 5 
Pr(m,lc<), plus CM independent numbers which determine 
an n-gram language model for a vocabulary of size C. If C 
is much smaller than V, these are many fewer numbers than 
are required to specify a general n-gram language model. 

In a simple n-gram class model, the C-l+V-C indepen- 30 
dent probabilities are treated as independent parameters. For 
such a model, values for the parameters can be determined 
by sequential maximum likelihood training. The order n 
probabilities are given by 



35 



(9) 



where f^O is the number of times that the sequence of 40 
classes c/ appears in the training text. (More precisely, f(c,*) 
is the number of distinct occurrences in the training text of 
a consecutive sequence of morphs m,' for which c^CmJ 
forl^k^L) 



45 



7 CLASSES 

The inventions described in this specification employ 
classes of morphs or words in a number of places. These 50 
include the class language model used by the decoder 702 
and described in Section 14, and some embodiments of the 
target structure language model 705. 

The inventors have devised a number of methods for 55 
automatically partitioning a vocabulary into classes based 
upon frequency or coocurrence statistics or other informa- 
tion extracted from textual corpora or other sources. In this 
section, some of these methods will be explained. An 
application to construction of syntactic classes of words will 60 
be described. A person skilled in the art can easily adapt the 
methods to other situations. For example, the methods can 
be used to construct classes of morphs instead of classes of 
words. Similarly, they can be used to construct classes based 55 
upon cooccurrence statistics or statistics of word alignments 
in bilingual corpora. 



34 



7.1 MAXIMUM MUTUAL INFORMATION CLUS- 
TERING 

A general scheme for clustering a vocabulary into classes 
is depicted schematically in FIG. 31. It takes as input a 
desired number of classes C 3101, a vocabulary 3102 of size 
V, and a model 3103 for a probability distribution P(w„ wj 
over bigrams from the vocabulary. It produces as output a 
partition 3104 of the vocabulary into C classes. In one 
application, the model 3103 can be a 2-gram language model 
as described in Section 6, in which case P(w lv Wj) would be 
proportional to the number of times that the Digram w,w 2 
appears in a large corpus of training text 

Let the score \|<C) of a partition £ be the average mutual 
information between the classes of C with respect to the 
probability distribution P(Wj, w 2 ): 



(10) 



In this sum, c, and C2 each run over the classes of the 
partition C, and 



P[c]) = T. P(ci,cz) 

C2 



/>(ci) = £^c h C2) 
ci 



(11) 

(12) 
(13) 



The scheme of FIG. 31 chooses a partition C for which the 
score average mutual information is large. 

7.2 A CLUSTERING METHOD 

One method 3204 for carrying out this scheme is depicted 
in FIG. 32. The method proceeds iteratively. It begins (step 
3203 with a partition of size V in which each word is 
assigned to a distinct class. At each stage of the iteration 
(Steps 3201 and 3202), the current partition is replaced by 
a new partition .which is obtained by merging a pair of 
classes into a single class. The pair of classes to be merged 
is chosen so that the score of the new partition is as large as 
possible. The method terminates after V-C iterations, at 
which point the current partition contains C classes. 

In order that it be practical, the method 3204 must be 
implemented carefully. At the \ fh iteration, a pair of classes 
to be merged must be selected from amongst approximately 
(V-i) 2 /2 pairs. The score of the partition obtained by merg- 
ing any particular pair is the sum of (V-i) 2 terms, each of 
which involves a logarithm. Since altogether there are V-C 
merges, this straight-forward approach to the computation is 
of order V s . This is infeasible, except for very small values 
of V. A more frugal organization of the computation must 
take advantage of the redundancy in this straight-forward 
calculation. 

An implementation will now be described in which the 
method 3204 executes in time of order V 3 . In this imple- 
mentation, the change in score due to a merge is computed 
in constant time, independent of V. 

Let t k denote the partition after V-k merges. Let C t (l), 
^(2), .... 0*00 denote the k classes of Let p*(l,rn)= 
P(e,(l), C*(m)) and let 



/>W)=Z«(lm) 
m 



(14) 



5,477,451 



35 



-continued 



Let I^vCCj be the score of so that 



(15) 



(16) 



10 



(17) 



Let I*(i j) be the score of the partition obtained from C fc by 
merging classes C^i) and C*(j), and let L t (ij)=I*~l*(io) be 
the change in score as a result of this merge. Then 



15 



(18) 



where 



«(0=Z?*<U) + £- 
/ m 



(19) 



In these and subsequent formulae, iuj denotes the result 
of the merge, so that, for example 



25 



Pti^Jj.m) = p k {Um) + p k (j,m) 

Pt(M»"0 



(20) 
(21) 30 



The key to the implementation is to store and inductively 
update the quantities 



35 



Ptttm) 



pntm) ftftm) 



(22) 



**(0 



Note that if 1^ s*(i), and s*(j), are known, then the majority 
of the time involved in computing Ijfij) is devoted to 40 
computing the suras on the second line of equation 18. Each 
of these sums has approximately V-k terms and so this 
reduces the problem of evaluating l k (i j) from one of order 
V 2 to one of order V. 

Suppose that the quantities shown in Equation 22 are 
known at the beginning of an interation. Then the new 
partition C k _ x is obtained by merging the pair of classes C^i) 
and C t (j), i<j, for which L*(i j) is smallest The k-1 classes 
of the new partition are ^(1), 0*^,(2), . . . (^(k-l) with 



45 



C^(i>C t (0uQ</) 



50 



55 



Obviously, I^Wij). The values of p t _„ pl^,, pr^, 
and can be obtained easily from p*, pl^ pr t , and q*. If 
1 and m denote indices neither of which is equal to either i 
or j, then 

j*-i(/N*(/H*(lliH?*(^*«j) -^tf^^fH^ift/) 
L i ^ i (l,m^L k il t m)^ k {Kjm t i)^ k ii,lum'Hj k (tumJ)-q k (J,lum) Hj k _ 



60 



65 



36 



i*-i(/.0^i(U) 



(23) 



Finally, s^,© and L^tti) are determined from equa- 
tions 18 and 19. 

This update process requires order V 2 computations. 
Thus, by this implementation, each iteration of the method 
requires order V 2 time, and the complete method requires 
order V 3 time. 

The implementation can improved further by keeping 
track of those pairs l,m for which p*(l,m) is different from 
zero. For example, suppose that P is given by a dimple 
Digram model trained on the data described in Table 3 of 
Section 6. In this case, of the 6.799xl0 10 possible word 
2-grams w„ w 2 , only 14,494,217 have non-zero probability. 
Thus, in this case, the sums required in equation 1 8 have, on 
average, only about 56 non-zero terms instead of 260,741 as 
might be expected from the size of the vocabulary. 

7.3 AN ALTERNATE CLUSTERING METHOD 

For very large vocabularies, the method 3204 may be too 
computationally costly. The following alternate method can 
be used. First, the the words of the vocabulary are arranged 
in order of frequency with the most frequent words first. 
Each of the first C words is assigned to its own distinct class. 
The method then proceeds iteratively through V-C steps. At 
the k* A step the (C+k)" most probable word is assigned to a 
new class. Then, two of the resulting C+l classes are merged 
into a single class. The pair of classes that is merged is the 
one for which the loss is average mutual information is least 
After V-C steps, each of the words in the vocabulary will 
have been assigned to one of C classes. 

7.4 IMPROVING CLASSES 

The classes constructed by the clustering method 3204 or 
the alternate clustering method described above can often be 
improved. One method of doing this is to repeatedly cycle 
through the vocabulary, moving each word to the class for 
which the resulting partition has the highest average mutual 
information score. Eventually, no word will move and the 
method finishes. It may be possible to further improve the 
classes by simultaneously moving two or more words, but 
for large vocabularies, such a search is too costly to be 
feasible. 

7.5 EXAMPLES 

The methods described above were used divide the 260, 
741-word vocabulary of Table 3, Section 6, into 1000 
classes. Table 4 shows some of the classes that are particu- 
larly interesting, and Table 5 shows classes that were 
selected at random. Each of the lines in the tables contains 
members of a different class. The average class has 260 
words. The table shows only those words that occur at least 
ten times, and only the ten most frequent words of any class. 
(The other two months would appear with the class of 
months if this limit had been extended to twelve). The 
degree to which the classes capture both syntactic and 
semantic aspects of English is quite surprising given that 
they were constructed from nothing more than counts of 
Digrams. The class {that tha theat} is interesting because 
although tha and theat are English words, the method has 
discovered that in the training data each of them is most 
often a mistyped that 

7.6 A METHOD FOR CONSTRUCTING SIMILARITY 
TREES 

The clustering method 3204 can also be used to construct 
a similarity tree over the vocabulary. Suppose the merging 
steps 3201 and 3202 of method 3204 are iterated V-l times, 
resulting in a single class consisting of the entire vocabulary. 
The order in which the classes are merged determines a 
binary tree, the root of which corresponds to this single class 



5,477,451 



37 



and the leaves of which correspond to the words in the 
vocabulary. Intermediate nodes of the tree correspond to 
groupings of words intermediate between single words and 
the entire vocabulary. Words that are statistically similar 
with respect to the model P(w lt wj will be close together in 5 
the tree. 

FIG. 30 shows some of the substructures in a tree con- 
structed in this manner using a simple 2-gram model for the 
1000 most frequent words in a collection of office corre- 
spondence. 10 

8 OVERVIEW OF TRANSLATION MODELS 
AND PARAMETER ESTIMATION 

This section is an introduction to translation models and 15 
methods for estimating the parameters of such models. A 
more detailed discussion of these topics will be given in 
Section 9. 



TABLE 4 



Classes from a 260,741- word vocabulary 



Friday Monday Thursday Wednesday Tuesday Saturday Sunday 
weekends Sundays Saturdays 

June March July April January December October November 
September August 

people guys folks fellows CEOs chaps doubters commies un- 
fortunates blokes 

down backwards ashore sideways southward northward over- 
board aloft downwards adrift 

water gas coal liquid add sand carbon steam shale iron 
great big vast sudden mere sheer gigantic lifelong scant colossal 
man woman boy girl lawyer doctor guy farmer teacher citizen 
American Indian European Japanese German African Catholic 
Israeli Italian Arab 

pressure temperature permeability density porosity stress velocity 
viscosity gravity tension 

mother wife father son husband brother daughter sister boss uncle 
machine device controller processor CPU printer spindle sub- 
system compiler plotter 

John George James Bob Robert Paul William Jim David Mike 
anyone someone anybody somebody 

feet miles pounds degrees inches barrels tons acres meters bytes 
director chief professor commissioner commander treasurer 
founder superintendent dean custodian 
liberal conservative parliamentary royal progressive Tory pro- 
visional separatist federalist PQ 

had hadn't hath would 've could' ve should' ve must've might* ve 
asking telling wondering instructing informing kidding reminding 
bothering thanking deposing 
that tha meat 

head body bands eyes voice arm seal eye hair mouth 



20 



25 



38 

TABLE 5-continued 



Randomly selected word classes 



court judge jury slam Edelstein magistrate marshal AbeUa ScaHa 
larceny 

annual regular monthly daily weekly quarterly periodic Good 
yearly convertible 

aware unaware unsure cognizant apprised mindful partakers 
force ethic stoppage force's conditioner stoppages conditioners 
waybill forwarder Atonabee 

systems magnetics loggers products' coupler Ecoo databanks 
Centre inscribcr correctors 

industry producers makers fishery Arabia growers addiction 
medalist inhalation addict 

brought moved opened picked caught tied gathered cleared hung 
lifted 



8. 1 TRANSLATION MODELS 

As illustrated in FIG. 21, a target structure to source 
structure translation model P e 706 with parameters G is a 
method for calculating a conditional probability, or likeli- 
hood, P e (fle), for any. source structure f given any target 
structure e. Examples of such structures include, but are not 
limited to, sequences of words, sequences of linguistic 
morphs, parse trees, and case frames. The probabilities 
satisfy: 



30 



PeC/fc) S 0, ^(failurelff) * 0, 
P 9 (failurelO + £/ , 0(/lO=l. 



f24) 



where the sum ranges over all structures f, and failure is a 
special symbol. P e (fle) can be interpretted as the probability 
that a translator will produce f when given e, and 
35 P e (failurele) can be interpreted as the probability that he will 
produce no translation when given e. A model is called 
deficient if P e (failurele) is greater than zero for some e. 
8.1.1 Training 

Before a translation model can be used, values must be 
40 determined for the parameters 6. This process is called 
parameter estimation or training. 

One training methodology is maximum likelihood train- 
ing, in which the parameter values are chosen so as to 
maximize the probability that the model assigns to a training 
45 sample consisting of a large number S of translations (f* J \ 
e w ) f s=l,2, . . . ,S. This is equivalent to maximizing the log 
likelihood objective function 



TABLE 5 



50 



Randomly selected word classes 



hole prima moment's trifle tad title minute's tinker's hornet's 
teammate's 

ask remind instruct urge interrupt invite congratulate commend 
warn applaud 

object apologize apologise avow whish 

cost expense risk profitability deferral earmarks capstone 

cardinality mintage reseller 

B dept. AA Whitey CL pi Namerow PA Mgr. LaRose 
# Rel rei #S Shree 

S Gens nm Matsuzawa ow Kageyama Nishida Sunrit ZoUner 
Mallik 

research training education science advertising arts medicine 
machinery Art AIDS 

rise focus depend rely concentrate dwell capitalize embark intrude 
typewriting 

Minister mover Sydneys Minster Mimter 

running moving playing setting holding carrying passing cutting 

driving fighting 



itfPe) = S~ l I log FtfV*) = ? Ctf *)logi>e(/te). 
*=1 f.e 



(25) 



Here C(f ,e) is the empirical distribution of the sample, so 
that C(f,e) is 1/S times the number of times (usually 0 or 1) 
that the pair (f,e) occurs in the sample. 

55 8. 1 .2 Hidden Alignment Models 

In some embodiments, translation models are based on 
the notion of an alignment between a target structure and a 
source structure. An alignment is a set of connections 
between the entries of the two structures. 

60 FIG. 22 shows an alignment between the target structure 
John V past 3s to kiss Mary and the source structure Jean V 
past 3s~embrasser Marie. (These structures are transduced 
versions of the sentences John kissed Mary and Jean a 
embrassS Marie respectively.) In this alignment, the entry 

65 John is connected with the entry Jean, the entry V past 3s is 
connected with the entry V past_3s, the entry to kiss is 
connected with the entry emBrasser, and the entry Mary is 



5,477,451 



39 



connected with the entry Marie. FIG. 23 shows an alignment 
between the target structure John V_past3s to kiss the girl 
and the source structure Le jeune fille V*]_passive_3s 
embrasser par Jean. (These structures are transduced ver- 
sions of the sentences John kissed the girl and Le jeune fille 5 
a i\& embrassee par Jean respectively.) 

In some embodiments, illustrated in FIG. 24, a translation 
model 706 computes the probability of a source structure 
given a target structure as the sum of the probabilities of all JQ 
alignments between these structures: 



P e (/l«)=Wal*). 



(26) 



In some embodiments, a translation model 706 can com- 
pute the probability of a source structure given a target 15 
structure as the maximum of the probabilities of all align- 
ments between these structures: 



P^J\e)=maxJ , ^S,a\€). 



(27) 



20 



40 



ccmbinarionaLJactor = n 
i=0 



fcrtility—prob = no 



( fol i 



-1 / E=1 



kxiaL_prob - n tiffe a j) 



distortion — prob : 



(29) 



(30) 



(31) 



(32) 



As depicted in FIG. 25, the probability P e (fle) of a single 
alignment is computed by a detailed translation model 2101. 
The detailed translation model 2101 employs a table 2501 of 
values for the parameters 6. 

8.2 AN EXAMPLE OF A DETAILED TRANSLATION 25 
MODEL 

A simple embodiment of a detailed translation model is 
depicted in FIG. 26. This embodiment will now be 
described. Other, more sophisticated, embodiments will be 3Q 
described in in Section 9. 

For the simple embodiment of a detailed translation 
model depicted in FIG. 26, the source and target structures 
are sequences of linguistic morphs. This embodiment com- 
prises three components: 35 

1. a fertility sub-model 2601; 

2. a lexical sub-model 2602; 

3. a distortion sub-model 2603. 

The probability of an alignment and source structure ^ 
given a target structure is obtained by combining the prob- 
abilities computed by each of these sub-models. Corre- 
sponding to these sub-models, the table of parameter values 
2501 comprises: 

la. fertility probabilities n(<()le) where 4> is any non- 45 
negative integer and e is any target morph; 

b. null fertility probabilities Dotylm 1 ), where is any 
non-negative integer and m' is any positive integer; 

2a. lexical probabilities t(fle), where f is any source 50 
morph, and e is any target morph; 

b. lexical probabilities t(fl*null*), where f is any source 
morph, and *null* is a special symbol; 

3. distortion probabilities a(jli,m), where m is any positive 
integer, i is any positive integer, and j is any positive integer 55 
between 1 and m. 

This embodiment of the detailed translation model 2101 
computes the probability P e (f,ale) of an alignment a and a 
source structure f given a target structure e as follows. If any 
source entry is connected to more than one target entry, then 
the probability is zero. Otherwise, the probability is com- 
puted by the formula 

P Q (f.a\ey^mbimtorwlfactorxfertUity probxlexical jprob xdistor- 

(28) 65 



Here 

1 is the number of morphs in the target structure; 

e, for i=l,2, ... ,1 is the i th morph of the target structure; 

e 0 is the special symbol *null*; 

m is the number of morphs in the source structure; 

fj for j=l,2, . . . ,m is the j 1 * morph of the source structure; 

§ a is the number of morphs of the target structure that are 

not connected with any morphs of the source structure; 
<j>, for i=l,2, ... 4 is the number of morphs of the target 

structure that are connected with the i ih morph of the 

source structure; 
&j for j=l,2, ... ,m is 0 if the j" 1 morph of the source 

structure is not connected to any morph of the target 

structure; 

a. for j=l,2, ... ,m is i if the 'f 1 morph of the source 
structure is connected to the i* morph of the target 
structure. 

These formulae are best illustrated by an example. Con- 
sider the source structure, target structure, and alignment 
depicted in FIG. 23. There are 7 source morphs, so m=7. The 
first source morph is Le, so f t =Le; the second source morph 
is jeune, so f^eune; etc. There are 5 target morphs so 1=5. 
The first target morph is John, so e^ohn; the second target 
morph is V past 3s, so e^V past 3s; etc. The first source 
morph (Le) is connected to the fourth target morph (the), so 
aj=4. The second source morph (jeune) is connected to the 
fifth target morph (girl) so ^=5. The complete alignment is 

01 =4 C2 = 5 oj = 5 04 = 2 flj = 3 flfi = 0 07-1 

All target morphs are connected to at least one source 
morph, so <t>o=0. The first target morph (John) is connected 
to exactly one source morph (Jean), so <j>j=l. The second 
target morph (V past 3s) is connected to exactly one source 
morph (V passive 3s)~ so <t>2=l . The complete set of fertilities 
are 



60 



00=0 = l <fc=l <fc=l fc=l <j* = 2 

The components of the probability P 0 (f,alc) are 
x = 0!l!l!i!l!2! note:0! = l 



where 



41 



5,477,451 



42 



-continued 

fertility-prob = no(* = 115) 

n(lUohD)n(llV-past_3j) 
ndlta-bssWlltbeWUgiii) 

lcxial_prob = t(U llhe)rtieuiiclgirl)r(filleJgirt) 

f(V_passivt— 33\V-pasu3s)ticmbnuser\ to_km) 
r(paii*null*)r(JcanUohn) 



distortion — prob 



■i. a(114,7)fl(215,7)a(3l5,7) 
o(4l2,7)fl(5l3,7)fl(7ll.7) 



Many other embodiments of the detailed translation 2101 
are possible. Five different embodiments will be described in 
Section 9 and Section 10. 

8.3 ITERATIVE IMPROVEMENT 

It is not possible to create of sophisticated model or find 
good parameter values at a stroke. One aspect of the present 
invention is a method of model creation and parameter 
estimation that involves gradual iterative improvement. For 
a given model, current parameter values are used to find 
better ones. In this way, starting from initial parameter 
values, locally optimal values are obtained. Then, good 
parameter values for one model are used to find initial 
parameter values for another model. By alternating between 
these two steps, the method proceeds through a sequence of 
gradually more sophisticated models. 

As illustrated in FIG. 27, the method comprises the 
following steps: 

2701. Start with a list of models Model 0, Model 1, Model 
2 etc. 

2703. Choose parameter values for Model 0. 

2702. Let i=0. 

2707. Repeat steps 2704, 2705 and 2706 until all models 
are processed. 

2704. Increment i by 1. 

2705. Use Models 0 to i-1 together with their parameter 
values to find initial parameter values for Model i. 

2706. Use the initial parameter values for Model i, to find 
locally optimal parameter values for Model i. 



8.3.1 Relative Objective Function 

Two hidden alignment models P e and P e of the form 
depicted in FIG. 24 can be compared using the relative 50 
objective function 1 



(35) 



which follows because the logarithm is concave. In fact, for 
any e and f , 



-continued 
= log £gjj§- = log PeW - log W). 



^) = Zp(r)log^- 



(37) 



(33) 



10 



15 



20 



25 



30 



35 



40 



45 



between probability distributions p and q. However, whereas 
the relative entropy is never negative, R can take any value. 
The inequality (35) for R is the analog of the inequality D 
>0 for D. 

8.3.2 Improving Parameter Values 

From Jensen's inequality (35), it follows that \|/(P 0 ) is 
greater than \|/(P 0 ) if R(P e , P e ) is positive. With P=P, this 
suggests the following iterative procedure, depicted in FIG. 
28 and known as the EM Method, for finding locally optimal 
parameter values 8 for a model P: 

2801. Choose initial parameter values 0. 

2803. Repeat steps 2802-2805 until convergence. 

2802. With 8 fixed, find the values 0 which maximize 
R(P* P 0 ). 

2805. Replace 9 by G. 

2804. After convergence, 6 will be locally optimal param- 
eter values for model P. 

Note that for any 9, R(P 0 , P e ) is non-negative at its 
maximum in 6, since it is zero for 9=9. Thus y(P e ) will not 
decrease from one iteration to the next. 

8.3.3 Going from One Model to Another 

Jensen's inequality also suggests a method for using 
parameter values 9 for one model P to find initial parameter 
values 9 for another model P: 

2804. Start with parameter values 9 for P 

2901. With P and 9 fixed, find the values 8 which 
maximize R(P e , P e )- 

2902. 9 will be good initial values for model P. 

In contrast to the case where P=P, there may not be any 
9 for which R(£ 0 , P e ) is non-negative. Thus, it could be that, 
even for the best 8, v(Pe)<V(Pe)' 

8.3.4 Parameter Reestimation Formulae 

To apply these procedures, it is necessary to solve the 
maximization problems of Steps 2802 and 2901. For the 
models described below, this can be done explicitly. To see 
the basic form of the solution, suppose P e is a simple model 
given by 



Po(f,a\e) (34) 

where P e (alfcHVa.fle)/ P ? (fle). Note that R(P 0 , f e>=0. R 
is related to y by Jensen's inequality 



OQEfi 



(38) 



where the 9(co),oxfl, are real-valued parameters satisfying 
the constraints 



60 



B(o»£0, £ 6(a)) =1 
coc£t 



(39) 



and for each co and (a,f,e), c(co;a£e) is a non-negative 
(36) integer. 2 It is natural to interpret 9(to) as the probability of 
65 the event co and c(co;a,f,e) as the number of times that this 
event occurs in (a£e). Note that 

2 More generally, we can allow c(co#,f,e) to be a non-negative real number. 



5,477,451 



43 



dn jc) = B(a» -g^- log PotfateX 



(40) 



The values for 8 that maximize the relative objective 
function Pq) subject to the constraints (39) are deter- 
mined by the Kuhn-Tucker conditions 



(41) 



where X is a Lagrange multiplier, the value of which is 
determined by the equality constraint in Equation (39). 
These conditions are both necessary and sufficient for a 
maximum since R(P 0 , Pq) is a concave function of the 6(cd). 
Multiplying Equation (41) by 0(co) and using Equation (40) 
and Definition (34) of R, yields the parameter reestimation 
formulae 



10 



15 



44 



If only the contribution from the single most probable 
alignment is included, the resulting procedure is called 
Viterbi Parameter Estimation. The most probable alignment 
between a target structure and a source structure is called the 
Viterbi alignment The convergence of Viterbi Estimation is 
easily demonstrated. At each iteration, the parameter values 
are re-estimated so as to make the current set of Viterbi 
alignments as probable as possible; when these parameters 
are used to compute a new set of Viterbi alignments, either 
the old set is recovered or a set which is yet more probable 
is found. Since the probability can never be greater than one, 
this process surely converge. In fact, it converge in a finite, 
though very large, number of steps because there are only a 
finite number of possible alignments for any particular 
translation. 

In practice, it is often not possible to find Viterbi align- 
ments exactly. Instead, an approximate Viterbi alignment 
can be found using a practical search procedure. 



e((D)=X" 1 CB«D),X= I C6((D). (42 > 

C8((D)=Z qr.«)cQ(ar/c), (43) 
/« 

cdm$c) = Z,Pe{,a\f t e)c{WBj,e). t 44 ) 25 

a 

£ 0 (co;f,e) can be interpretted as the expected number of 
times, as computed by the model P e , that the event co occurs 
in the translation of e to f. Thus 6(to) is the (normalized) 
expected number of times, as computed by model P e , that co 30 
occurs in the translations of the training sample. 

These formulae can easily be generalized to models of the 
form (38) for which the single equality constraint in Equa- 
tion (39) is replaced by multiple constraints 

35 

Z 6((D) = l,»i=lA.... (45) 
ooeftyi 

where the subsets u=l,2, . . . foim a partition of ft. Only 
Equation (42) needs to be modified to include a different \, ^ 
for each u; if toel^, then 

e(o)) = ^ l ce(a>). ^= I ce(a». ( 46 > 

45 

8.3.5 Approximate and Viterbi Parameter 
Estimation 

The computation required for the evaluation of the counts 
5 in Equation 44 increases with the complexity of the model 50 
whose parameters are being determined 

For simple models, such as Model 1 and Model 2 
described below, it is possible to calculate these counts 
exactly by inducting the contribution of each possible align- 
ments. For more sophisticated models, such as Model 3, 55 
Model 4, and Model 5 described below, the sum over 
alignments in Equation 44 is too costly to compute exactly. 
Rather, only the' contributions from a subset of alignments 
can be practically included. If the alignments in this subset 
account for most of the probability of a translation, then this 60 
truncated sum can still be a good approximation. 

This suggests the following modification to Step 2802 of 
the iterative procedure of FIG. 28: 

•In calculating the counts using the update formulae 44, 
approximate the sum by including only the contribu- 65 
tions from some subset of alignments of high probabil- 
ity. 



8.4 Five Models 

Five detailed translation models of increasing sophistica- 
tion will be described in Section 9 and Section 10. These 
models will be referred to as Model 1, Model 2, Model 3, 
Model 4, and Model 5. 

Model 1 is very simple but it is useful because its 
likelihood function is concave and consequently gas a global 
maximum which can be found by the EM procedure. Model 
2 is a slight generalization of Model 1 . For both Model 1 and 
Model 2, the sum over alignments for the objective function 
and the relative objective function can be computed very 
efficiently. This significantly reduces the computational 
complexity of training. Model 3 is more complicated and is 
designed to more accurately model the relation between a 
morph of e and the set of morphs in f to which it is 
connected. Model 4 is a more sophisticated step in this 
direction. Both model 3 and Model 4 are deficient. Model 5 
is a generalization of Model 4 in this deficiency is removed 
at the expense of more increased complexity. For Models 3,4 
and 5 the exact sum over alignments can not be computed 
efiiciently. Instead, this sum can be approximated by restrict- 
ing it it alignments of high probability. 

Model 5 is a preferred embodiment of a detailed transla- 
tion model. It is a powerful but unwieldy ally in the battle 
to describe translations. It must be led to the battlefield by 
its weaker but more agile brethren Models 2, 3, and 4. In 
fact, this is the raison d'etre of these models. 

9 DETAILED DESCRIPTION OF 
TRANSLATION MODELS AND PARAMETER 
ESTIMATION 

' In this section embodiments of the statistical translation 
model that assigns a conditional probability to the event that 
a sequence of lexical units in the source language is a 
translation of a sequence of lexical units in the target 
language will be described. Methods for estimating the 
parameters of these embodiments will be explained. 

For concreteness the discussion will be phrased in terms 
of a source language of French and a target language of 
English. The sequences of lexical units will be restricted to 
sequences of words. 

It should be understood that the translation models, and 
the methods for their training generalize easily to other 
source and target languages, and to sequences of lexical 
units comprising lexical morphemes, syntactic markers, 
sense labels, case frame labels, etc. 



5,477,451 



45 

9.1 Notation 



46 



Random variables will be denoted by upper case letters, 
and the values of such variables will be denoted by the 
corresponding lower case letters. For random variables X 5 
and Y, the probability Pr(Y=ylX=x) will be denoted by the 
shorthand P(ylx). Strings or vectors will be denoted by bold 
face letters, and their entries will be denoted by the corre- 
sponding non-bold letters. 

In particular, e will denote an English word, e will denote ]0 
a string of English words, and E will denote a random 
variable that takes as values strings of English words. The 
length of an English string will be denoted 1, and the 
corresponding random variable will be denoted L. Similarly, 
f will denote a French word, f will denote a string of English 15 
words, and F will denote a random variable that takes as 
values strings of French words. The length of a French string 
will be denoted m, and the corresponding random variables 
will be denoted M. 

A table notation appears in Section 10. 20 

9.2 Translations 

A pair of strings that are translations of one another will 
be called a translation. A translation will be depicted by 
enclosing the strings that make it up in parentheses and 
separating them by a vertical bar. Thus, (Qu'aurions-nous pu 
faire?IWhat could we have done?) depicts the translation of 
Qu'aurions-nous pu faire? as What could we have done?. 
When the strings are sentences, the final stop will be omitted 30 
unless it is a question mark or an exclamation point. 



25 



9.3 Alignments 



35 



Some embodiments of the translation model involve the 
idea of an alignment between a pair of strings. An alignment 
consists of a set of connections between words in the French 
string and words in the English string. An alignment is a 
random variable, A; a generic values of t his variable will be 40 
denoted by a. Alignments are shown graphically, as in FIG. 
33, by writing the English string above the French string and 
drawing lines from some of the words in the English string 
to some of the words in the French string. These lines are 
called connections. The alignment in FIG. 33 has seven 45 
connections, (the, Le), (program, programme), and so on. In 
the description that follows, such alignment will be denoted 
as (Le programme a &e* mis en application! And the(l) 
program(2) has(3)been(4) implemented(5,6,7)). The list of 
numbers following an English word shows the positions in 50 
the French string of the words with which it is aligned. 
Because And is not aligned with any French words here, 
these is no list of numbers after it Every alignment is 
considered to be correct with some probability. Thus (Le 
programme a 6ti mis en applicationlAnd( 1,23 ,4,5,6,7) the 55 
program has been implemented) is perfectly acceptable. Of 
course, this is much less probable than the alignment shown 
in FIG. 33. 

As another example, the alignment (Le reste appartenait 
aux autochtoneslThe(l) balance(2) was(3) the(3) territory(3) 60 
of(4) the(4) aboriginal(5) people(5)) is shown in FIG. 34. 
Sometimes, two or more English words may be connected to 
two or more French words as shown in FIG. 35. This will be 
denoted as (Les pauvres sont demunislThe(l) poor(2) 
don*t(3,4) have(3,4) any(3,4) money(3,4). Here, the four 65 
English words don't have any money conspire to generate 
the two French words sont demunis. 



The set of English words connected to a French word will 
be called the notion that generates it An alignment resolves 
the English string into a set of possibly overlapping notions 
that is called a notional scheme. Hie previous example 
contains the three notions. The, poor, and don't have any 
money. Formally, a notion is a subset of the positions in the 
English string together with the words occupying those 
positions. To avoid confusion in describing such cases, a 
subscript will be affixed to each word showing its position. 
The alignment in FIG. 34 includes the notions the 4 and 
of 6 the7, but not the notions of 6 the 4 or the;. In (J'applaudis 
*a la decisional ) applaud(2) the(4) decision(5)), a is gen- 
erated by the empty notion. Although the empty notion has 
no position and so never requires a clarifying subscript, it 
will be placed at the beginning of the English string, in 
position zero, and denoted by e^ At times, therefore, the 
previous alignment will be denoted as (J'applaudis k la 
decisionleo(3) 1(1) applaud(2) the(4) decision(5)). 

The set of all alignments of (fie) will be written «4(e,f). 
Since e has length 1 and f has length m, there are lm different 
connections that can be drawn between them. Since an 
alignment is determined by the connections that it contains, 
and since a subset of possible connections can be chosen in 

2 /m ways, there are 2 to alignments in ^(e.f). 

For the Models 1-5 described below, only alignments 
without multi-word notions are allowed. Assuming this 
restriction, a notion can be identified with a position in the 
English string, with the empty notion identified with posi- 
tion zero. Thus, each position in the French string is con- 
nected to one of the positions 0 through 1 in the English 
string. The notation a^=i will be used to indicate that a word 
in position j of the French string is connected to the word in 
position i of the English string. 

The probability of a French string f and an alignment a 
given an English string e can be written 



P{f.a\e) = P[m\e) J. Ptejt£ l t £\ m. e)P1JfoJ,fr, m.e). 

Here, m is the length of f and a^ is determined by a. 
9.4 Model 1 

The conditional probabilities on the right-hand side of 
Equation (47) cannot all be taken as independent parameters 
because there are too many of them. In Model 1, the 
probabilities P(rale) are taken to be independent of e and m; 
that P(a > la 1 >_1 , f/"\ m, e), depends only on 1, the length of 
the English string, and therefore must be (1+1 and that 
P^a, 3 ,^ 1 , m, e) depends only on f 7 and e^. The param- 
eters, then, are e=P(m!e), and t(f y le q/ ^P(f}la 1 7 , f/~\ m,e), 
which will be called the translation probability of f} given 
c aJ . The parameter e is fixed at some small number. The 
distribution of M is unnormalized but this is a minor 
technical issue of no significance to the computations. In 
particular, M can be restricted to some finite range. As long 
as this range encompasses everything that actually occurs in 
training data, no problems arise. 

A method for estimating the translation probabilities for 
Model 1 will now be described. The joint likelihood of a 
French sentence and an alignment is 



M r 



(48) 



5,477,451 



47 



The alignment is determined by specifying the values of & 
for j from 1 to m, each of which can take any value from 0 
to L Therefore, 



/ / m 



(49) 



(50) 



An iterative method for doing so will be described. 

The method is motivated by the following consideration- 
Following standard practice for constrained maximization, a 
necessary relationship for the parameter values at a local 
maximum can be found by introducing Lagrange multipli- 
ers, \, and seeking an unconstrained maximum of the 20 
auxiliary function 



ai=0 0^=0 j=\ * 



(51) 



25 



At a local maximum, all of the partial derivatives of h with 
respect to the components of t and X are zero. That the partial 
derivatives with respect to the components of K be zero is 
simply a restatement of the constraints on the translation 
probabilities. The partial derivative of h with respect to t(fle) 
is 



48 



Maximization Technique in Statistical Estimation of Proba- 
bilistic Functions of a Markov Process, appearing the jour- 
nal Inequalities, Vol. 3, in 1972. 

With the aid of Equation (48), Equation (53) can be 
re-expressed as 



The first goal of the training process is to find the values 
of the translation probabilities that maximize P(fle) subject 1Q 
to the constraints that for each e, 



m = \ 1 1 Pif.de) f fitf fjfre, e a ). 



15 



number of times e connects to /in a 

The expected number of times that e connects to f in the 
translation (fie) will be called the count of f given e for (fie) 
and will be denoted by c(fle;f,e). By definition, 



(55) 

a /=1 ' 

where P(ale,f>P(f,ale)/P(fle). If X e is replaced by V^fle), 
then Equation (54) can be written very compactly as 



(56) 



30 



In practice, the training data consists of a set of translations, 
(fa>| e <*>), (fC2)| e <2)) (f^fcW), so this equation becomes 



(57) 



Here, % e serves only as a reminder that the translation 
probabilities must be normalized. 



ft H. ' ' m (52> 



where 5 is the Kronecker delta function, equal to one when 
both of its arguments are the same and equal to zero 
otherwise. This will be zero provided that 



H. ' ' * m (53> 

ai=0 flm=0/=l i fc=l 



Superficially Equation (53) looks like a solution to the 
maximization problem, but it is not because the translation 
probabilities appear on both sides of the equal sign. None- 
theless, it suggests an iterative procedure for finding a 
solution: 

1. Begin with initial guess for the translation probabilities; 

2. Evaluate the right-hand side of Equation (53); 

3. Use the result as a new estimate for t(fle). 

4. Iterate steps 2 through 4 until converged. 60 
(Here and elsewhere, the Lagrange multipliers simply serve 

as a reminder that .the translation probabilities must be 
normalized so that they satisfy Equation (50).) This process, 
when applied repeatedly is called the EM process. That it 
converges to a stationary point of h in situations like this, as 65 
demonstrated in the previous section, was first shown by L. 
E. Baum in an article entitled, An Inequality and Associated 



Usually, it is not feasible to evaluate the expectation in 
Equation (55) exactly. Even if multiword notions are 
excluded, there are still (1+1 ) m alignments possible for (fie). 
Model 1, however, is special because by recasting Equation 
(49), it is possible to obtain an expression that can be 
evaluated efficiently. The right-hand side of Equation (49) is 
a sum of terms each of which is monomial in the translation 
probabilities. Each monomials contains m translation prob- 
abilities, one for each of the words in f. Different monomials 
correspond to different ways of connecting words in f to 
notions in e with every way appearing exactly once. By 
direct evaluation, then 



5,477,451 



49 



r . 

a i=0 



/ m 
. £ it 



(58) 



Therefore, the sums in Equation (49) can be interchanted 
with the product to obtain 



Using this expression, it follows that 



(59) 



10 



count of e in e 



(60) 15 



■ .S ugg) £«m& 



Thus, the number of operations necessary to calculate a 
count is proportional to lm rather than to (1+1 ) m as Equation 
(55) might suggest 

The details of the initial guesses for t(fle) are unimportant 
because P(fle) has a unique local maximum for Model 1, as 
is shown in Section 10. In practice, the initial probabilities 
t(fle) are chosen to be equal, but any other choice that avoids 
zeros would lead to the same final solution. 



which satisfy the constraints 



for each triple jml. Equation (49) is replaced by 



(64) 



25 



9.5 Model 2 

Model 1, takes no cognizance of where words appear in 
either string. The first word in the French string is just as 35 
likely to be connected to a word at the end of the English 
string as to one at the beginning. In contrast, Model 2 makes 
the same assumptions as in Model 1 except that it assumes 
that P^a/ -1 , f/"\ m,e) depends on j, ay, and m, as well as 
on 1. This is done using a set of alignment probabilities, 



(61) 



45 



(62) 



50 



(63) 



A relationship among the parameter values for Model 2 55 
that maximize this likelihood is obtained, in analogy with 
the above discussion by seeking an unconstrained maximum 
of the auxiliary function 



a =0 flm=0 j=i t 

Sl0«)-1 )~Zhw( Za(i]/,m,0-l 65 



50 



It is easily verified that Equations (54), (56), and (57) 
carry over from Model 1 to Model 2 unchanged In addition, 
an iterative update formulas for a(ilj, m, 1), can be obtained 
by introducing a new count, c(ilj, m, l;f,e), the expected 
number of times that the word in position j of f is connected 
to the word in position i of e. Clearly, 



(65) 



In analogy with Equations (56) and (57), for a single 
translation, 



and, for a set of translations, 



(66) 



(67) 



20 



5=1 

Notice that if f < * ) does not have the length m or if e w does 
not have length 1, then the corresponding count is zero. As 
with the A.'s in earlier equations, the u's here serve to 
normalize the alignment probabilities. 

Model 2 shares with Model 1 the important property that 
the sums in Equations (55) and (65) can be obtained effi- 
ciently. Equation (63) can be rewritten 



30 



/=1 M> 

Using this form for P(fle), it follows that 



cV\e;f.e)=J 



40 



/=! 1=0 '(/l«o)a(OI/,m.O + . . . + ftfle/Mflj,m,l) 



(68) 



(69) 



(70) 



60 



Equation (69) has a double sum rather than the product of 
two single sums, as in Equation (60), because, in Equation 
(69), i and j are tied together through the alignment prob- 
abilities. 

Model 1 is a special case of Model 2 in which a(ilj, m, 1) 
is held fixed at (1+1 r 1 . Therefore, any set of parameters for 
Model 1 can be reinterpreted as a set of parameters for 
Model 2. Taking as initial estimates of the parameters for 
Model 2 the parameter values that result from training 
Model 1 is equivalent to computing the probabilities of all 
alignments using Model 1, but then collecting the counts 
appropriate to Model 2. Hie idea of computing the prob- 
abilities of the alignments using one model, but collecting 
the counts in a way appropriate to a second model is very 
general and can always be used to transfer a set of param- 
eters from one model to another. 

9.6 Intermodel Interlude 

Models 1 and 2 make various approximations to the 
conditional probabilities that appear in Equation (47). 
Although Equation (47) is an exact statement, it is only one 
of many ways in which the joint likelihood of f and a can be 
written as a product of conditional probabilities. Each such 
product corresponds in a natural way to a generative process 
for developing f and a from e. In the process corresponding 



5,477,451 



51 



10 



to Equation (47), a length for f is chosen first Next, a 
position in e is selected and connected to the first position in 
f. Next, the identity of f j is selected. Next, another position 
in e is selected and this is connected to the second word in 
f t and so on. 

Casual inspection of some translations quickly establishes 
that the is usually translated into a single word Qe, la or 1'), 
but is sometimes omitted; or that only is often translated into 
one word (for example, seulement), but sometimes into two 
(for example, ne . . . que), and sometimes into none. The 
number of French words to which e is connected in an 
alignment is a random variable, <S> e , that will be called its 
fertility. Each choice of the parameters in Model 1 or Model 
2 determines a distribution, Pr(O tf =0), for this random 
variable. But the relationship is remote: just what change 
will be wrought in the distribution of if, say, a(l 12, 8, 
9) is adjusted, is not immediately clear. In Models 3, 4, and 
5, fertilities are parameterized directly. 

As a prolegomenon to a detailed discussion of Models 3, 
4, and 5, the generative process upon which they are based 
will be described. Given an English string, e, a fertility for 
each word is selected, and a list of French words to connect 
to it is generated. This list, which may be empty, will be 
called a tablet The collection of tablets is a random variable, 
T, that will be called the tableau of e; the tablet for the \ tH 
English word is a random variable, T,; and the k" 1 French 30 
word in the X th tablet is a random variable, T (k . After 
choosing the tableau, a permutation of its words is gener- 
ated, and f is produced. This permutation if a random 
variable, II. Hie position of f of the k th word in the i' A tablet 
is yet another a random variable, 11^. 

The joint likelihood for a tableau, t, and a permutation, n, 



20 



25 



35 



/>(T,Kk) = 



(70 



40 



I ft 
« n 



45 



r=l k=l 



50 



to 

n 
fc=I 



In this equation, x,,*^ 1 represents the series of values T n , 
. . . , t^j; represents the series of values n n , . . . , 
and 0 ei . 

Knowing i and n determines a French string and an 
alignment, but in general several different pairs t, n may 
lead to the same pair f, a. The set of such pairs will be 
denoted by (f,a). Clearly, then 



55 



60 



PV.a\e)= Z Pfrrte). 
The number of elements in (f,a) is 



C?2) 65 



52 



/ 

n ft! 
r=0 

because for each t, there are 0,! arrangements that lead to the 
pair f,a. FIG. 36 shows the two tableaux for (boh 
marche1cheap(l,2)). 
Except for degenerate cases, there is one alignment in 

,4(e,f) for which P(ale,f) is greatest This alignment will be 
called the Viterbi alignment for (fie) and will be denoted by 
V(fle). For Model 2 (and, thus, also for Model 1), finding 
V(fle) is straightforward. For each j, a, is chosen so as to 
make the product t(f^le fly )a(aylj, m, 1) as large as possible. The 
Viterbi alignment depends on the model with respect to 
which it is computed. In what follows, the Viterbi align- 
ments for the different models will be distinguished by 
writing V(fle;l), V(fle;2), and so on. 
The set of all alignment for which ay=i will be denoted by 

>(i<-j(e,f). The pair ij will be said to be pegged for the 

elements of >M<-j(e,f) The element of >li<-j(e,f) for which 
P(ale,f) is greatest will be called the pegged Viterbi align- 
ment for ij, and will be denoted by V^fle), Obviously, 
V^/fle;!) and V^fle^) can be found quickly with a 
straightforward modification of the method described above 
for finding V(fle;l) and V(fle;2) 

9.7 Model 3 

Model 3 is based on Equation (71). It assumes that, for i 
between 1 and 1, P(0 / l0 1 < "* I ,e) depends only on 0, and e,; that, 
for all i, PCt^lr,!*" 1 * Tq~ 1 , 0q#) depends only on and e,; 
and that, for i between 1 and 1, P(7t flk i7t n *" 1 , jc,*" 1 , i 0 \ 0 o 'e) 
depends only on i, m, and 1. Hie parameters of Model 3 
are thus a set of fertility probabilities, n(0!e £ )s=P(0i0 1 , " 1 ,e); 
a set of translation probabilities, t(fle jsPrO'a^n*'" 1 , V~ 
i, 0o'»e); and a set of distortion probabilities, d(jK, m, 
l^PrCDapjlli,,*" 1 , *T\ To', 0 o '.e). 

The distortion and fertility probabilities for eo are handled 
differently. The empty notion conventionally occupies posi- 
tion 0, but actually has no position. Its purpose is to account 
for those words in the French string that cannot readily be 
accounted for by other notions in the English string. Because 
these words are usually spread uniformly throughout the 
French string, and because they are placed only after all of 
the other words in the sentence have been placed, the 
probability PrQI^jsjlTtoj*, n,', t 0 ', 0 o ',e) is set to 0 unless 
position j is vacant in which case it is set (0 o -k) _1 . There- 
fore, the contribution of the distortion probabilities for all of 
the words in x 0 is 1/0 O !. 

The value of 0 O depends on the length of the French 
sentence since longer sentences have more of these extra- 
neous words. In fact Model 3 assumes that 



(73) 



for some pair of auxiliary parameters p 0 and p x . The expres- 
sion on the left-hand side of this equation depends on 0/ 
only through the sum 0 t +. . . +0, and defines a probability 
distribution over 0 O whenever Dq and p, are nonnegative and 
sum to 1. The probability P(0 o l0j',e) can be interpretted as 
follows. Each of the words from t,' is imagined to require 
an extraneous word with probability p,; this word is required 
to be connected to the empty notion. The probability that 



5,477,451 

53 54 

exactly 0 O of the words from x/ will require an extraneous 

word is just the expression given in Equation (73). -continued 

As in Models 1 and 2, an alignment of (fie) in Model 3 is $ (g4) 

determined by specifying a y for each position in the French Pk^" 1 *> rt^e**). 

string. The fertilities, 0 O through 0 Z , are functions of the ay's. 5 4=1 
Thus, 0, is equal to the number of j's for which a, equals i. 
Therefore, 

/ / C74) 

15 

with Z/(fle)=l, XydQIi, m, 1>=1, L^fale^l, and Po+Pi=l. Equations (76) and (81) are identical to Equations (55) 

According to the assumptions of Model 3, each of the pairs m ^ (57) are repealed here only for convenience. Equa- 
(x,n) in (f,a) makes an identical contribution to the sum in . . /tJ _. . .. t _ . Aem , . 

Equation (72). Tte factorials in Equation (74) come from *™ C"> ^ ( 82 > « S1 ™ lar t0 E ^ anOTS < 65 > ^ <«>. but 
carrying out this sum explicitly. There is no factorial for the a(ilj, m, 1) differs from d(jli, m, 1) in mat the former sums to 
empty notion because it is exactly cancelled by the contri- um ty over all i for fixed j while the latter sums to unity over 

butionfrommerts^ all j for fixed L Equations (78), (79), (80), (83), and (84), for 

By now, it should be clear how to provide an appropriate J ^ v /9K " v " v 

auxiliary function for seeking a constrained maximum of the ^ fertility parameters, are new. 

likelihood of a translation with Model 3. A suitable auxiliary 25 The trick that permits the a rapid evaluation of the 
function is right-hand sides of Equations (55) and (65) efficiently for 

Model 2 does not work for Model 3. Because of the fertility 
h0.d,n.pXwt) = (75) parameters, it is not possible to exchange the sums over a x 

through a^, with the product over j in Equation (74) as was 
- 1 K ( ) - x Mwd^(/Um,0 - 1) - done for Equations (49) and (63). A method for overcorning 

this difficulty now will be described. This method relies on 

I v e ( Xn($itf) - 1 ) - $(po +pi - 1). 35 ^ e ^ act t * iat 801116 atf S 1 * 11161118 are muc h more probable than 
c ^ ♦ others. TTie method is to carry out the sums in Equations (74) 

and (76) through (80) only over some of the more probable 
As with Models 1 and 2, counts are defined by alignments, ignoring the vast sea of much less probable 

40 ones. 

(76) 

dflc;f.e)=ZP{a\ej) 1 tyJj)bXe.e a X To define the subset, 5, of the elements of >l(fle) over 

which the sums are evaluated a little more notation is 
cQiMf.e) = Z {\a\ejfti, aj \ C77) Two alignments, a and a' will be said to differ by 

t 45 a move if there is exactly one value of j for which a^a/. 

ctye$e)*LP(fi)ej) £ 6«Mfo)6(e.*/), Alignments will be said to differ by a swap if a/=a/ except 

at two values, j l and j 2 , for which fy^' and a^fy'- The 



d0M = XP(a\eJXm-2to) 
a 

50 



two alignments will be said to be neighbors if they are 
identical or differ by a move or by a swap. The set of all 

(80) neighbors of a will be denoted by Jsf (a). 
c{\'ic)=ZP{Q\eMo. Ut ^ be ^ neighbor of a for which ^ likelihood is 

In terms of these counts, the reestimation formulae for 55 Su PP° se ij " peg£ed f ° r ** Am0! * ^ neigh_ 

Model 3 are bors of * for which ij is also pegged, let b^/a) be that for 

which the likelihood is greatest The sequence of alignments 

s (gl) a, b(a), b 2 (a)sb(b(a)) converges in a finite number of 

t(flc)=K ^ ctfe^J**)* ^ steps to an alignment that will be denoted as b~(a). Simi- 

larly, if ij is pegged for a, the sequence of alignments a, 



s (82) 
rf(/Um,i) =v£t x m^\^\ *Wa), b,«_/(a), . . . , converges in a finite number of steps 



to an alignment that will be denoted as b f «_/"(a). The simple 

n(4>le) a vj 1 I cWeJHJ®). ^ 65 fonn of ^ di8toruon probabilities in Model 3 makes it easy 

* =1 to find b(a) and b ( -«_/a). If a' is a neighbor of a obtained from 

and it by the move of j from i to i\ and if neither i nor i' is 0, then 



55 



5,477,451 



56 



n(fr + )lgf) 
nflftef) 



-lied 



Notice that 0„ is the fertility of the word in position i' for 
alignment a. The fertility of this word in alignment a* is 0 f +l . 
Similar equations can be easily derived with either i or i' is 
zero, or when a and a* differ by a swap. 

Given these preliminaries, Sis denned by 



tyed 
5 



d{j\Lm,t) 



(85) 



JO 



(86) 



In this equation, b"XV(fle;2)) and b,«^~(V^./fle;2)) are 
approximations to V(fle;3) and V^fleiS) neither of which 
can be computed efficiently. 

In one iteration of the EM process for Model 3, the counts 
in Equations (76) through (80), are computed by summing 2 q 

only over elements of 5- These counts are then used in 
Equations (81) through (84) to obtain a new set of param- 
eters. If the error made by including only some of the 

elements of A(efl is not too great, this iteration will lead to 
values of the parameters for which the likelihood of the 25 
training data is at least as large as for the first set of 
parameters. 

The initial estimates of the parameters of Model 2 are 
adapted from the final iteration of the EM process for Model 
2. That is, the counts in Equations (76) through (80) are . 30 
computed using Model 2 to evaluate P(aie,f). The simple for 
of Model 2 again makes exact calculation feasible. The 
Equations (69) and (70) arc readily adapted to compute 
counts for the translation and distortion probabilities; effi- 
cient calculation of the fertility counts is more involved A 35 
discussion of how this is done is given in Section 10. 

9.8 Deficiency 

A problem with the parameterization of the distortion 
probabilities in Model 3 is this: whereas the sum over all 
pairs T,7t of the expression on the right-hand side of Equation 
(71) is unity, if PrOW=aln#i" .w,*"*. y, 0 o ',e) depends only 
on j, i, m, and 1 for i>0. Because the distortion probabilities 
for assigning positions to later words do not depend on the 45 
positions assigned to earlier words, Model 3 wastes some of 
its probability on what will be called generalized strings, i.e., 
strings that have some positions with several words and 
others with none. When a model has this property of not 
concentrating all of its probability on events of interest, it 50 
will be said to be deficient. Deficiency is the price for the 
simplicity that permits Equation (85). 

Deficiency poses no serious problem here. Although 
Models 1 and 2 are not technically deficient, they are surely 
spiritually deficient Each assigns the same probability to the 55 
alignments (Je n'ai pas de styloll(l) do not(2,4) have(3) a(5) 
pen(6)) and (Je pas ai ne de styloH(l) do not(2,4) have(3) 
a(5) pen(6)), and, therefore, essentially the same probability 
to the translations (Je n'ai pas de styloll do not have a pen) 
and (Je pas ai ne de stylcll do not have a pen). In each case, 60 
not produces two words, ne and pas, and in each case, one 
of these words ends up in the second position of the French 
string and the other in the fourth position. The first transla- 
tion should be much more probable than the second, but this 
defect is of Utile concern because while the system may be 65 
required to translate the first string someday, it will almost 
surely not be required to translate the second. The translation 



models are not used to predict French given English but 
rather as a component of a system designed to predict 
English given French. They need only be accurate to within 
a constant factor over well-formed strings of French words. 

9.9 Model 4 

Often the words in an English string constitute phrases 
that are translated as units into French. Sometimes, a trans- 
lated phrase may appear at a spot in the French string 
different from that at which the corresponding English 
phrase appears in the English string. The distortion prob- 
abilities of Model 3 do not account well for this tendency of 
phrases to move around as units. Movement of a long phrase 
will be much less likely than movement of a short phrase 
because each word must be moved independently. In Model 
4, the treatment of Pr(II tt =jbi n *" 1 , rc^ 1 , V, 0 o ',e) is modi- 
fied so as to alleviate this problem. Words that are connected 
to the empty notion do not usually form phrases and so 
Model 4 continues to assume that these words are spread 
uniformly throughout the French suing. 

As has been described, an alignment resolves an English 
string into a notional scheme consisting of a set of possibly 
overlapping notions. Each of these notions then accounts for 
one or more French words. In Model 3 the notional scheme 
for an alignment is determined by the fertilities of the words: 
a word is a notion of its fertility is greater than zero. The 
empty notion is a part of the notional scheme if 0 O is greater 
than zero. Multi-word notions are excluded Among the 
one-word notions, there is a natural order corresponding to 
the order in which they appear in the English string. Let [i] 
denote the position in the English string of the \ th one-word 
notion. The center of this notion, O,, is defined to be the 
ceiling of the average value of the positions in the French 
string of the words from its tablet The head of the notion is 
defined to be that word in its tablet for which the position in 
the French string is smallest. 

In Model 4, the probabilities d(jli, m, 1) are replaced by 
two sets of parameters: one for placing the head of each 
notion, and one for placing any remaining words. For [i]>0, 
Model 4 requires that the head for notion i be t im . It 
assumes that 

Wwrfi»i w "Wib , ^W^ .-,i>Wi]> # Ofl). (87) 
Here, >(and flare functions of the English and French word 
that take on a small number of different values as their 
arguments range over their respective vocabularies. In the 
Section entitled Classes, a process is described for dividing 
a vocabulary into classes so as to preserve mutual informa- 
tion between adjacent classes in running text. The classes 

.4 and flare constructed as functions with fifty distinct 
values by dividing the English and French vocabularies each 
into fifty classes according to this method. The probability is 
assumed to depend on the previous notion and on the 
identity of the French word being placed so as to account for 
such facts as the appearance of adjectives before nouns in 
English but after them in French. The displacement for the 
head of notion i is denoted by j-O f _i. It may be either 
positive or negative. The probability of d^-ll^Ue) fl(f)) is 

expected to be larger than d 1 (+ll>4(e) fl(f)) when e is an 
adjective and f is a noun. Indeed, this is borne out in the 



5,477,< 

57 

trained distortion probabilities for Model 4, where 
di (-11^4 (government's) #(developpement) is 0.9269, 
while dj(+l [^(government's) /J(d6veloppement)) ^ 
0.0022. 5 

The placement of the k th word of notion i for [i]>0, k>l 
is done as follows. It is assumed that 

PK"lO«l%|i W . «i !0_l . TbW. *>d>i<H<\^Blfj)). (88) 

The position n m is required to be greater than tc,^!. 10 
Some English words tend to produce a series of French 
words that belong together, while others tend to produce a 
series of words that should be separate. For example, 
implemented can produce mis en application, which usually 
occurs as a unit, but not can produce ne pas, which often 15 

occurs with an intervening verb. Thus d^ (210 (pas)) is 
expected to be relatively large compared with d>i(2l (en)). 
After training, it is in fact true that 0^(21 /J(pas)) is 0.6936 

and d^lflten)) is 0.0522. 20 

It is allowed to place x im either before or after any 
previously positioned words, but subsequent words from t UJ 
are required to be placed in order. This does not mean that 
they must occupy consecutive positions but only that the 
second word from x [n must lie to the right of the first, the 25 
third to the right of the second, and so on. Because of this, 
at most one of the 0 in ! arrangements of T, n is possible. 

The count and reestimation formulae for Model 4 are 
similar to those for the previous Models and will not be 
given explicitly here. The general formulae in Section 10 are 30 
helpful in deriving these formulae. Once again, the several 
counts for a translation are expectations of various quantities 
over the possible alignments with the probability of each 
alignment computed from an earlier estimate of the param- 
eters. As with Model 3, these expectations are computed by 35 

sampling some small set, 5, of alignments. As described 
above, the simple form that for the distortion probabilities in 
Model 3, makes it possible to find b~ (a) rapidly for any a. 
The analogue of Equation (85) for Model 4 is complicated 
by the fact that when a French word is moved from notion 40 
to notion, the centers of two notions change, and the 
contribution of several words is affected. It is nonetheless 
possible to evaluate the adjusted likelihood incrementally, 
although it is substantially more time consuming. 

Faced with this unpleasant situation, the ' following 45 
method is employed. Let the neighbors of a be ranked so that 
the first is the neighbor for which P(aie,f;3) is greatest, the 
second the one for which P(ale,f;3) is next greatest, and so 
on. Then b(a) is the highest ranking neighbor of a for which 
P(b(a)le,f;4) is at least as large as P(alf,e;4). Define b,«_/a) 50 
analogously. Here, P(aJe,f;3) means P(a!e,f) as computed 
with Model 3, and P(ale,f;4) means P(ale,f) as computed 

with Model 4. Define 5 for Model 4 by 

55 

(89) 

= (fc-(V(/k;2)))uu (C/ViHOk*))). 

h 

This equation is identical to Equation (89) except that b 
has been replaced by e. 60 

9.10 Model 5 

Like Model 3 before it, Model 4 is deficient. Not only can 
several words lie on top of one another, but words can be 65 
placed before the first position or beyond the last position in 
the French string. Model 5 removes this deficiency. 



58 

After the words for x, [ ' M and T UJ1 W are placed, there 
will remain some vacant positions in the French string. 
Obviously, x U]k should be placed in one of these vacancies. 
Models 3 and 4 are deficient precisely because they fail to 
enforce this constraint for the one- word notions. Let vG/ij 
I0 ~\ T^,,*" 1 ) be the number of vacancies up to and including 
position j just before T mJk is placed. This position will be 
denoted simply as v.. Two sets of distortion parameters are 
retained as in Model 4; they will again be referred to as d l 
and cL^. For [i]>0, is is required that 

W W i^V^\lbW.')= ^i(v^(l^.v CM .„v in Xl-«(v > ,v > _ J )). (90) 

The number of vacancies up to j is the same as the number 
of vacancies up to j-1 only when j is not, itself, vacant. The 
last factor, therefore, is 1 when j is vacant and 0 otherwise. 
As with Model 4, dj is allowed to depend on the center of 
the previous notion and on f}, but is not allowed to depend 
on e,^! j since doing so would result in too many parameters. 

For [i]>0 and k>l, it is assumed that 

W' W r^ fl i*":^i w " , .V.<b , .«) ^> 1 (v r v rtll4wJ i0cr i ).v w ,-v, lo 

t-x) V-kvpVj-i)). (91) 

Again, the final factor enforces the constraint that i li]k 
land in a vacant position, and, again, it is assumed that the 
probability depends on fj only through its class. 

The details of the count and reestimation formulae are 
similar to those for Model 4 and will not be explained 
explicitly here. Section 10 contains useful equations for 
deriving these formulae. No incremental evaluation of the 
likelihood of neighbors is possible with Model 5 because a 
move or swap may require wholesale recomputation of the 
likelihood of an alignment. Therefore, when the expecta- 
tions for Model 5 are evaluated, only the alignments in S as 
defined in Equation (89) are included. The set of alignments 
included in the sum is further trimmed by removing, any 
alignment, a, for which P(ale,f;4) is too much smaller than 
P(b-(V(fle;2)le,f;4). 

Model 5 provides a powerful tool for aligning transla- 
tions. It's parameters are reliably estimated by making use 
of Models 2, 3 and 4. In fact, this is the raison d'etre of these 
models. To keep them aware of the lay of the land, their 
parameters are adjusted as the iterations of the EM process 
for Model 5 are performed. That is, counts for Models 2, 3, 
and 4 arc computed by summing over alignments as deter- 
mined by the abbreviated S described above, using Model 5 
to compute P(ale,f). Although this appears to increase the 
storage necessary for maintaining counts as the training is 
processed, the extra burden is small because the overwhelm- 
ing majority of the storage is devoted to counts for t(fle), and 
these are the same for Models 2, 3, 4, and 5. 

9.11 Results 

A large collection of training data is used to estimate the 
parameters of the five models described above. In one 
embodiment of these models, training data is obtained using 
the method described in detail in the paper, Aligning Sen- 
tences in Parallel Corpora, by P. F. Brown, J. C Lai, and R. 
L. Mercer, appearing in the Proceedings of the 29th Annual 
Meeting of the Association for Computational Linguistics, 
June 1991. This paper is incorporated by reference herein. 
This method is applied to a laige number of translations 
from several years of the proceedings of the Canadian 
parliament. From these translations, a training data set is 
chosen comprising those pairs for which both the English 
sentence and the French sentence are thirty words or less in 
length. This is a collection of 1,778,620 translations. In an 



5,477,451 



59 



60 



effort to eliminate some of the typographical errors that 
abound in the text, a English vocabulary is chosen consisting 
of all of those words that appear at least twice in English 
sentences in the data, and as a French vocabulary is chosen 
consisting of all those words that appear at least twice in 
French sentences in the data. All other words are replaced 
with a special unknown English word or unknown French 
word according as they appear in an English sentence or a 
French sentence. In this way an English vocabulary of 

42.005 words and a French vocabulary of 58,016 words is 
obtained. Some typographic errors are quire frequency, for 
example, momento for memento, and so the vocabularies are 
not completely free of them. At the same time, some words 
are truly rare, and in some cases, legitimate words are 
omitted. Adding Cq to the English vocabulary brings it to 

42.006 words. 

Eleven iterations of the EM process are performed for this 
data. The process is initialized by setting each of the 
2,437,020,096 translation probabilities, t(fle), to 1/58016. 
That is, each of the 58,016 words in the French vocabulary 
is assumed to be equally likely as a translation for each of 
the 42,006 words in the English vocabulary. For t(fle) to be 
greater than zero at the maximum likelihood solution for one 
of the models, f and e must occur together in at least one of 
the translations in the training data. This is the case for only 
25,427,016 pairs, or about one percent of all translation 
probabilities. On the average, then, each English word 
appears with about 605 French words. 

Table 6 summarizes the training computation. At each 
iteration, the probabilities of the various alignments of each 30 
translation using one model are computed, and the counts 
using a second, possibly different model are accumulated. 
These are referred to in the table as the In model and the Out 
model, respectively. After each iteration, individual values 
are retrained for only those translation probabilities that 
surpass a threshold; the remainder are set to the small value 
(10~ 12 ). This value is so small that it does not affect the 
normalization conditions, but is large enough that translation 
probabilities can be resurrected during later 

TABLE 6 



10 



15 



20 



25 



35 



Iteration 


In -* Out 


Survivors 


Alignments 


Perplexity 


45 




2-> 1 


12,017,609 




71550.56 




2 


2-> 2 


12,160,475 




20199 




3 


2 -*2 


9.403,220 




89.41 




4 


2->2 


6,837,172 




61.59 




5 


2->2 


5303.312 




49.77 




6 


2 -> 2 


4397,172 




46.36 


50 


7 


2->3 


3,841.470 




45.15 




8 


3->5 


2,057.033 


291 


124.28 




9 


5->5 


1,850.665 


95 


39.17 




10 


5->5 


1.763,665 


48 


32.91 




11 


5->5 


1.703393 


39 


31.29 




12 


5 ->5 


1,658364 


33 


30.65 


55 



iterations. As is apparent from columns 4 and 5, even though 
the threshold is lowered as iterations progress, fewer and 
fewer probabilities survive. By the final iteration, only 
1,620,287 probabilities survive, an average of about thirty- 
nine French words for each English word. 

As has been described, when the In model is neither 
Model 1 nor Model 2, the counts are computed by summing 
over only some of the possible alignments. Many of these 
alignments have a probability much smaller than that of the 
Viterbi alignment The column headed Alignments in Table 
6 shows the average number of alignments for which the 



60 



65 



probability is within a factor of 25 of the probability of the 
Viterbi alignment in each iteration. As this number drops, 
the model concentrates more and more probability onto 
fewer and fewer alignments so that the Viterbi alignment 
becomes ever more dominant 

The last column in the table shows the perplexity of the 
French text given the English text for the In model of the 
iteration. The likelihood of the training data is expected to 
increase with each iteration. This likelihood can be thought 
of as arising from a product of factors, one for each French 
word in the training data. There are 28,850,104 French 
words in the training data so the 28,850, 104 ,A root of the 
likelihood is the average factor by which the likelihood is 
reduced for each additional French word. The reciprocal of 
this root is the perplexity shown in the table. As the 
likelihood increases, the perplexity decreases. A steady 
decrease in perplexity is observed as the iterations progress 
except when a switch from Model 2 as the In model to 
Model 3 is made. This sudden jump is not because Model 3 
is a poorer model than Model 2, but because Model 3 is 
deficient; the great majority of its probability is squandered 
on objects that are not strings of French words. As has been 
explained, deficiency is not a problem. In the description of 
Model 1, the P(mle) was left unspecified. In quoting per- 
plexities for Models 1 and 2, it is assumed that the length of 
the French string is Poisson with a mean that is a linear 
function of the length of the English string. Specifically, it is 
assumed that PrCM^IeMXire-^/m!, with X equal to 1.09. 

It is interesting to see how the Viterbi alignments change 
as the iterations progress. In Table 7, the Viterbi alignment 
for several sentences after iterations 1, 6, 7, and 11 are 
displayed These are the final iterations for Models 1, 2, 3, 
and 5, respectively. In each example, a subscripted in affixed 
to each word to help in interpreting the list of numbers after 
each English word. In the first example, (II me semble faire 
signe que ouillt seems to me that he is nodding), two 
interesting changes evolve over the course of the iterations. 
In the alignment for Model 1, II is correctly connected to he, 
but in all later alignments D is incorrectly connected to It 
Models 2, 3, and 5 discount a connection of he to H because 
it is quite far away. None of the five models is sophisticated 
enough to make this connection properly. On the other hand, 
while nodding,, is connected only to signe by Models 1, 2, 
and 3, and oui, it is correctly connected to the entire phrase 
faire signe que oui by Model 5. In the second example, 
(Voyez les profits que ils ont realiseslLook at the profits they 
have made), Models 1, 2, and 3 incorrectly connect profits 4 
to both profits 3 and realise^, but with Model 5, profits 4 is 
correctly connected to profit^ and made 7 is connected to 
realise^. Finally, in (De les promesses, de les 
promesses ! Promises, promises.), Promises! is connected to 
both instances of promesses with Model 1 ; promises 3 is 
connected to most of the French sentence with Model 2; the 
final punctuation of the English sentence is connected to 
both the exclamation point and, curiously, to de 3 with Model 
3. Only Model 5 produces a satisfying alignment of the two 
sentences. The orthography for the French sentence in the 
second example is Voyez les profits qu'ils ont realises and in 
the third example is Des Promesses, des promesses! Notice 
that the e has been restored to the end of qu' and des has 
twice been analyzed into its constituents, de and les. These 
and other petty pseudographic improprieties are commited 
in the interest of regularizing the French text In all casesr 
orthographic French can be recovered by rule from the 
corrupted versions. 



5,477,451 



61 



Tables 8 through 17 show the translation probabilities and 
fertilities after the final iteration of training for a number of 
English words. All and only those probabilities that are 
greater than 0.01 are shown. Some words, like nodding, in 
Table 8, do not slip gracefully into French. Thus, there are 
translations like (II fait signe que ouilHe is nodding), (H fait 
un signe de la t&elHe is nodding), (D fait un signe de t&te 
affirmatiflHe is nodding), or (II hoche la t&te 
affirmaiivementIHe is nodding). As a result, nodding fre- 
quently has a large fertility and spreads its translation 
probability over a variety of words. In French, what is worth 
saying is worth saying in many different ways. This is also 
seen with words like should, in Table 9, which rarely has a 
fertility greater than one but still produces many different 
words, among them devrait, devraient, devrions, doit, 
doivent, devons, and devrais. These are'Qust a fraction of the 
many) forms of the French verb devoir. Adjectives fare a 
little better: national, in Table 10, almost never produces 
more than one word and confines itself to one of national e, 
national, nationaux, and nationales, respectively the femi- 
nine, the masculine, the masculine plural, and the feminine 
plural of the corresponding French adjective. It is clear that 
the models would benefit from some kind of morphological 
processing to rein in the lexical exuberance of French. 

As seen from Table 11, the produces le, la, les, and 1* as 
is expected. Its fertility is usually 1, but in some situations 
English prefers an article where French does not and so 
about 14% of the time its fertility is 0. Sometimes, as with 
farmers, in Table 12, it is French that prefers the article. 
When this happens, the English noun trains to product its 
translation 

TABLE 7 

The progress of alignments with iteration. 

Ill m&j sembles rmre 4 agats que* oui 7 

It seemsQ) to(4) me(2) that(6) he(l) is nodding(5,7) 
It(l) seems(3) to me(2) that he is nodding(5.7) 
It(l) seems(3) to(4) me{2) that(6) he is nodding(5,7) 
It(l) seeras(3) to me(2) that he is nodding(4,5,6,7) 

Voycz, le$2 profits, que 4 ils 3 ont« realises-? 

Look(l) at the(2) profits(3,7) they(5) have(6) made 
Look(l) at the(2,4) pn>ats(3 t 7) they(5) have(6) made 
Loolc(l) at the profits(3,7) they(5) have(6) made 
Look(l) at thc(2) profits(3) they<5) have(6) madefl) 

Dei le *a promesses 3 ,4 de 5 les 6 promcsses? ! 8 
Promises(3,7) ,(4) promises .(8) 
Promises,(4) pronrises(23,6,7) .(8) 
Promises (3) ,(4) promises(7) .(5,8) 
Promises(2,3) ,(4) promises(6,7) .(8) 



10 



15 



20 



25 



30 



35 



40 



45 



TABLE 8 



Translation and fertility probabilities for nodding. 



ttfte) 



n(*le) 



signe 

la 

tele 

otri 

fait. 

que 

hoche 

bocher 

faire 



0.164 
0.123 
0.097 
0.086 
0.073 
0.073 
0.054 
0.048 
0.030 



0.342 
0.293 
0.167 
0.163 
0.023 



50 



55 



60 



62 



TABLE 8-continued 



Translation and fertility probabilities for nodding. 
nodding 



ttfte) 



nflle) 



me 


0.024 




approove 


0.019 




qui 


0.019 




un 


0.012 




faites 


0.011 




TABLE 9 


Translation 


and fertility probabilities for should, 
should 


f 


tCfle) $ 


n($le) 


devrait 


0.330 1 


0.649 


devraient 


0.123 0 


0336 


devions 


0.109 2 


0.014 


faudrait 


0.073 




fant 


0.058 




doit 


0.058 




aurait 


0.041 




doivent 


0.024 




devons 


0.017 




devrais 


0.013 




TABLE 10 


Translation and fertility probabilities for national, 
national 


f 


t(ffe) + 


W*) 


nanonale 


0.469^ 1 


0.905 


national 


0.418 0 


0.094 


nationaiix 


0.054 




nationales 


0.029 




TABLE 11 


Translation and fertility probabilities for the. 




the 




f 


tCfc) " «> 


n(0!e) 


le 


0.497 1 


0.746 


la 


0.207 0 


0.254 


les 


0.155 




r 


0.086 




OS 


0.018 




cette 


0.011 




TABLE 12 


Translation and fertility probabilities for farmers, 
fanners 


f 


ttfte) * 


n«fc) 


agricnheuTS 


0.442 2 


0.731 


les 


0.418 1 


0.228 


cultivateuTS 


0.046 0 


0.039 


producteurs 


0.021 





5,477,451 



63 

TABLE 13 



64 



Translation and fertility probabilities for external. 











5 


f 


tCfte) 


«> 


n(<j>le) 




exterieures 


0.944 


1 


0.967 




exterieur 


0.015 


0 


0.028 




externe 


0.011 








exterieurs 


0.010 






10 


TABLE 14 


Translation and fertility probabilities for as 


swer. 






answer 






15 


f 


ttfte) 


♦ 


n«>te) 




rdponse 


0.442 


1 


0.809 




repondre 


0.223 


2 


0.115 




rtpoodu 


0.041 


0 


0.074 


20 


"a 


0.038 








solution 


0.027 








rtpondez 


0.021 








repondna 


0.016 








reponde 


0.014 








y 


0.013 






25 


ma 


0.010 






TABLE 15 


Translation and fertility probabilities for oil. 


30 




oil 








f 


ttfle) 


♦ 


nOlfc) 




pe"trole 


0.558 


1 


0.760 




petrolieres 


0.138 


0 


0.181 


35 


pdtrohere 


0.109 


2 


0.057 




le 


0.054 








pdtroliex 


0.030 








p6troners 


0.024 








huile 


0.020 








Oil 


0.013 






40 


TABLE 16 


Translation and fertility probabilities for former. 


45 




former 






f 


ttfk) 


* 


n«|>le) 




and en 


0.592 


1 


0.866 




anriens 


0.092 


0 


0.074 




ex 


0.092 


2 


0.060 


50 


prd cedent 


0.054 








r 


0.043 








ancienne 


0.018 








6t6 


0.013 









fertility 2 and usually produces either agriculteurs or les. 
Additional examples are included in Tables 13 through 17, 
which show the translation and fertility probabilities for 
external, answer, oil, farmer, and not. 

FIGS. 37, 38, and 39 show automatically derived align- 
ments for three translations. In the terminology used above, 
each alignment is b~(V(fle;2). It should be understood that 
these alignments have been found by a process that involves 
no explicit knowledge of either French or English. Every 
fact adduced to support them has been discovered automati- 
cally from the 1,778,620 translations that constitute the 
training data This data, in turn, is the product of a process 
the sole linguistic input of which is a set of rules explaining 
how to find sentence boundaries in the two languages. It may 
justifiably be argued, therefore, that these alignments are 
inherent in the Canadian Hansard data itself. 

In the alignment shown in FIG. 37, all but one of the 
English words has fertility 1. The final prepositional phrase 
has been moved to the front of the French sentence, but 
otherwise the translation is almost verbatim. Notice, how- 
ever, that the new proposal has been translated into les 
nouvelles propositions, demonstrating that number is not an 
invariant under translatioa The empty notion has fertility 5 
here. It generates en lt d^, the comma, de 16 , and de, 8 . 

In FIG. 38, two of the English words have fertility 0, one 
has fertility 2, and one, embattled, has fertility 5. Embattled 
is another word, like nodding, that eludes the French grasp 
and comes with a panoply of multi-word translations. 

The final example, in FIG. 39, has several features that 
bear comment Hie second word, Speaker, is connected to 
the sequence TOrateur. Like farmers above, it has trained to 
produce both the word that one naturally thinks of as its 
translation and the article in front of it In the Hansard data, 
Speaker always has fertility 2 and produces equally often 
rOrateur and le president Later in the sentence, starred is 
connected to the phrase marquees de un astensque. From an 
initial situation in which each French word is equally 
probable as a translation of starred, the system has arrived, 
through training, at a situation where it is able to connect to 
just the right string of four words. Near the end of the 
sentence, give is connected to donnerai, the first person 
singular future of donner, which means to give. It might be 
better if both will and give were connected to donnerai, but 
by limiting notions to no more than one word, Model 5 
precludes this possibility. Finally, the last twelve words of 
the English sentence, I now have the answer and will give 
it to the House, clearly correspond to the last seven words of 
the French sentence, je donnerai la rgponse a la Chambre, 
but literally, the French is I will give the answer to the 
House. There is nothing about not, have, and, or it and each 
of these words has fertility 0. Translations that are as far as 
this from the literal are rather more the rule than the 
exception in the training data. 



55 

TABLE 17 



Translation and fertility probabilities for not. 
not 



f 


ttfte) 




n(<|>le) 


ne 


0.497 


2 


0.735 


pas 


0.442 


0 


0.154 


non 


0.029 


1 


0.107 


rien 


0.011 







60 



65 



together with an article. Thus, farmers typically has a 



9.12 Better Translation Models 

Models 1 through 5 provide an effective means for 
obtaining word-by-word alignments of translations, but as a 
means to achieve the real goal of translation, there is room 
for improvement 

Two shortcomings are immediately evident. Firstly, 
because they ignore multi-word notions, these models are 
forced to give a false, or at least an unsatisfactory, account 
of some features in many translations. Also, these models 
are deficient either in fact as with Models 3 and 4, or in 
spirit as with Models 1, 2, and 5. 



5,477,451 



65 

9.12.1 Deficiency 



66 



It has been argued above that neither spiritual nor actual 
deficiency poses a serious problem, but this is not entirely 
true. Let w(e) be the sum of P(fle) over well-formed French 
strings and let i(e) be the sum over ill-formed French strings. 
In a deficient model, w(e)fi(e)<l. In this case, the remainder 
of the probability is concentrated on the event failure and so 
w(eHKe}+P(failurele)=l. Clearly, a model is deficient pre- 
cisely when P(failurele)>0. if P(failurele)=0, but i(e>0, then 
the model is spiritually deficient If w(e) were independent 10 
of e, neither form of deficiency would pose a problem, but 
because Models 1-5 have no long-term constraints, w(e) 
decreases exponentially with 1. When computing align- 
ments, even this creates no problem because e and f are 
known. However, for a given f, if the goal is to discover the 15 
most probable 6, then the product P(e)P(fle) is too small for 
long English strings as compared with short ones. As a 
result, short English strings are improperly favored over 
longer English strings. This tendency is counteracted in part 
by the following modification: 20 

•Replace P(fle) with c'P(fle) for some empirically chosen 
constant c. 

This modification is treatment of the symptom rather than 
treatment of the disease itself, but it offers some temporary 
relief. The cure lies in better modelling. 25 

9.12.2 Multi-word Notions 



echoue* and a person en plan. High and dry, therefore, is a 
promising three-word notion because its translation is not 
compositional. 

10 MATHEMATICAL SUMMARY OF 
TRANSLATION MODELS 



c 
c 

1 

i 

5 
r 



j 

h 

a 



30 



35 



40 



Models 1 through 5 all assign non-zero probability only to 
alignments with notions containing no more than one word 
each. Except in Models 4 and 5, the concept of a notion plays 
little rdle in the development. Even in these models, notions 
are determined implicitly by the fertilities of the words in the 
alignment: words for which the fertility is greater than zero 
make up one-word notions; those for which it is zero do not. 
It is not hard to give a method for extending the generative 
process upon which Models 3, 4, and 5 are based to 
encompass multi-word notions. This method comprises the 
following enhancements: 

•Including a step for selecting the notional scheme; 

•Ascribing fertilities to notions rather than to words; 

•Requiring that the fertility of each notion be greater 
than zero. 

•In Equation (71), replacing the products over words in 
an English string with products over notions in the 45 
notional scheme. 

Extensions beyond one-word notions, however, requires 
some care. In an embodiment as described in the subsection 
Results, an English string can contain any of 42,005 one- 
word notions, but there are more than 1.7 billion possible 5 q 
two-word notions, more than 74 trillion three- word notions, 
and so on. Clearly, it is necessary to be discriminating in 
choosing potential multi-word notions. The caution dis- 
played by Models 1-5 in limiting consideration to notions 
with fewer than two words is motivated primarily by respect 55 
for the featureless desert that multi-word notions offer a 
priori. The Viterbi alignments computed with Model 5 give 
a frame of reference from which to develop methods that 
expands horizons to multi-word notions. These methods 
involve: 

•Promoting a multi-word sequence to notionhood when 
its translations as a unit, as observed in the Viterbi 
alignments, differ substantially from what is expected 
on the basis of its component parts. 
In English, for example, either a boat or a person can be 
left high and dry, but in French, un bateau is not left haut et 
sec, nor une personne haute et seche. Rather, a boat is left 



60 



1 

Vi 

V 
ft 
k 

t* 
it 

V(flc) 
V|«_j(fte) 

JV'Ca) 

b(a) 
b-(a) 

Art 

B m 

V, V* 

p. 
Ci 

[i] 
Pe 

C(f, c) . 
V(Po) 
R(P*Pe) 
Kite) 



10.1 Summary of Notation 



English vocabulary 

English word 
English sentence 
length of e 

position in e, i = 0, 1, . . 1 
word i of e 
the empty notion 
e,^ . . . e, 
French vocabulaxy 

French ward 
French sentence 
length of f 

position in f, j = 1, 2, . . ., m 
word j of f 



position in e connected to position j of f 
for alignment a 
ajaj . . . 8j 

number of positions of f connected to 
position i of e 

tableau - a sequence of tablets, where a 
tablet is a sequence of French word 
tablet i of t 
tot, ... T; 
length of ti 

position within a tablet, k = 1, 2, . . 
word k of \ 

a permutation of the positions of a tableau 
position in f for word k of t, for permuta- 
tion n 

ntlHi2 . . . TCfc 

Viterbi alignment for (fie) 

Viterbi alignment for (fle) with ij pegged 

neighboring alignments of a 

neighboring alignments of a with ij 

pegged 

alignment in (a) with greatest proba- 



65 



Po.Pi 



hty 

alignment obtained by applying b 
repeatedly to a 

alignment in .V y (a) with 

greatest probability 

alignment obtained by applying b i4 _j 

repeatedly to a 

class of English word e 

class of French word f 

displacement of a word in f 
vacancies in f 

first position in e to the left of i that has 
non-zero fertility 

average position in f of the words 
connected to position i of e 
position in e of the i* one word notion 
%l 

translation model P with parameter 
values 6 

empirical distribution of a sample 
log-likelihood objective function 
relative objective function 
translation probabilities (All Models) 
sentence length probabilities (Models 1 
and 2) 

fertility probabilities (Models 3, 4, and 5) 
fertility probabilities for Co (Models 3, 4, 



67 



5,477,451 



68 



-continued 



aOQ. I, m) 
d(jK, 1. m) 

d,(Ajl>4. B) 

d>,CAjlC) 
(tR(L_or_jffl, v, V) 

cMAiltf.v) 
MAjltf , v) 
d>,(AjtC.v) 



and 5) 

alignment probabilities (Model 2) 
distortion probabilities (Model 3) 
distortion probabilities for the first word 

of a tablet (Model 4) 

distortion probabilities for the other words 
of a tablet (Model 4) 

distortion probabilities for choosing left or 
right movement (Model 5) 
distortion probabilities for leftward 

movement of the first word of a tablet 
(ModelS) 

distortion probabilities for rightward 

movement of the first word of a tablet 
(Model 5) 

distortion probabilities for movement 
of the other words of a tablet (Model 5) 



10.2 Model 1 



Parameters 



e(mll) string length probabilities 

t(fle) translation probabilities 

Here fej"; ee£or e^; 1=1,2, . . . ; and m=l f 2, . . . 

General Formula 

P Q <f,a\e)=Pdm\e)P Q {a\m t e)PMia t m.e) 
Assumptions 

P e (mltf) = e(mlO 



P(tfa.nue)= Tt ttye a ) 



25 



(92) 



(93) 
(94) 
(95) 



= e(mlf)(/ + iy 



-continued 
- m ' 
/=1 «=o 



(98) 



Equation (98) is useful in computations since it involves 
only O(lm) arithmetic operations, whereas the original sum 
over alignments (97) involves 0(l m ) operations. 

10 Concavity 

The objective function (25) for this model is a strictly 
concave function of the parameters. In fact, from Equations 
(25) and (98), 



20 



(99) 



Z QJ,t) £ log E ti/Aej) + £ C(/»loge(mlQ + constant 
f,e j=l i=0 f.e 



30 



which is clearly concave in e(mll) and t(fle) since the 
logarithm of a sum is concave, and the sum of concave 
function is concave. 

Because if is concave, it has a unique local maximum. 
Moreover, this maximum can be found using the method of 
FIG, 28, provided that none of the initial parameter values 
is zero. 

10.3 Model 2 

Parameters 

e(mll) string length probabilities 
t(fle) translation probabilities 
35 a(ilj, 1, m) alignment probabilities 
Here i=0, . . . , 1; and j=l, . . . , m. 

General Formula 
AO P 0 V,a\c>P Q {m\c)? e («K e)P B {J\a, m. e) (100) 



This model is not deficient. 



Assumptions 



Generation 45 p e (mle) = e(mH) (101) 

Equations (92)-(95) describe the following process for m (102) 

producing f from e: /» 0 (<ilm,e)= « otyjAm) 

1. Choose a length m f or f according to the probability ^ 

distribution e(mll). 50 m (io3) 

2. For each j=l, 2, . . . , m, choose a, from 0,1,2, ... 1 * tyfa} 
according to the uniform distribution. & 

3. For each j=l, 2, . . , m, choose a French word f, ^ mode , is nol deficient Model ] is ^ specia i case of 
according to the distnbuuon t(ffo,). ^ model in which the ^ i&xmtTii probabilities are uniform: 

55 a(ilj, 1, mHl+1)" 1 for all i. 

Useful Formulae 

Because of the independence assumptions (93M95), the Generation 

sum over alignments (26) can be calculated in closed form: {my{m dcscribe ^ following process for 

60 producing f from e: 

Ptffle)=zPtf.aie) 06) 1- Choose a length m for f according to the distribution 



e(mll). 



, , m (97) 2. For each j=l, 2 m, choose a^ from 0, 1, 2, ... 1 

= e(mi/) (/+ 1)" 1 ^ . . . ^z^ n tyfaj) 65 according to the distribution a(aylj, 1, m). 

3. For each j, choose a French word t} according to the 
distribution tffye^). 



5,477,451 



69 

Useful Formulae 



Just as for Model 1, the independence assumptions allow 
us to calculate the sum over alignments (26) in closed form: 



70 



Equations (110HH 4 ) involve only 0(lm) arithmetic 
operations, whereas the sum over alignments involves OQ™) 
operations. 

10.4 Model 3 



a 



(104) 



. / I m 

= €<m!/)(/+ir X ... I n tVfaMafrLm) 
fl!=0 fl«=0^i f 



(105) 



10 



(106) 



By assumption (1 02) the connections of a are independent 15 
given the length m of f. From Equation (106) it follows that 
they are also independent given f: 



Parameters 

t\f\e) translation probabilities 

n($k) fertility probabilities 

poj>\ fertility probabihtes for «o 

d(j\i,l,m) distortion probata lites 



Here <H>. 1,2,. 



General Formulae 



(107) 



j=i 

where 



20 



(108) 



with 



25 



30 



Viterbi Alignment 



For this model, and thus also for Model 1, the Viterbi 
alignment V(fle) between a pair of strings (f, e) can be 
expressed in closed form: 35 



ViflctfvrgmaxityetWj, l m). 

Parameter Reestimation Formulae 



(109) 



The parameter values 6 that maximise the relative objec- 40 
tive function R(?e, Pq) can be found by applying the 
considerations of subsection 8.3.4. The counts c(o; a, f, e) 
of Equation (38) are 



7=1 

c{Sj.lm;aJ,e) = o\i t Oj). 



(110) 



(111) 



45 



(112) 



as is the case for Models 1 and 2 (see Equation (107)), then $o 
this sum can be calculated explicitly: 



M)/=l 



(113) 



65 



(114) 



PeV,a\e) = f Z Ptft t ide) 
(V0e<#a> 



(115) 
(116) 



Here <f, a> is the set of all (t, n) consistent with (f, a): 

(l, w)e# a> if for all i=0 f 1, , ... land fc=l, 2, . . . , fy,/,^,* 
and a^ari, (117) 



Assumptions 



\ r=l / i=x 



I */ 

Pe(ifye)= it n tinted 
M) fc=l 



i=l k=l 



where 



(118) 



(119) 



(12) 



(121) 



In Equation (120) the factor of l/<)> 0 ! accounts for the 
choices of n^, fc=l, 2, . . . , fa This model is deficient, since 



The parameter reestimation formulae for t(fle) and a(ilj, 1, 50 
m) are obtained by using these counts in Equations 
(42M46). 

Equation (44) requires a sum over alignments. If P 8 
satisfies 

55 



P 8 (failurelT t 4»,e) ■ 1 - X PdrtxAe) > 0. 



Generation 



(122) 



Equations (115M120) describe the following process for 
producing f or failure from e: 

1. For each i=l, 2, . . . , 1, choose a length <)>, for t £ 
according to the distribution n(((> I -le i ). 

2. Choose a length <|> 0 for i 0 according to the distribution 

3. Let m^o+L^fa • 

4. For each i=0, 1 1 and each k=l, 2, . . . , fa choose 

a French word t tt according to the distribution t(t tt le,). 

5. For each i=l, 2, .... 1 and fc=l, 2, . . . , fa choose a 
position 7t tt from 1, . . . , m according to the distribution 
dfrrtH, 1, m). 



5,477,451 

71 72 

6. If any position has been chose more than once then 

return failure. -continued 

7. For each k=l, 2, . . . , <fo, choose a position from 

the <|> 0 -k+l remaining vacant positions in 1, 2, .... m Unfortunately, there is no analogous formula for e e (<()le; f, 
according to the uniform distribution. 5 c) mstead ^ following formulae hold: 

8. Let f be the string with f^^r*. 

Useful Formulae , m 033) 

ce«>i*c/»= I Steed n u- 
From Equations (118M120) it follows that if (t, 7t) are io w J=\ 

consistent with (f t a) then 

(123) 

(124) T^W* 

;Vfl ^° p^ c)== _^M^_ (135) 

In Equation (124), the product runs over aU j=l, 2, . . . , ' 1 ~*WM 

m except those for which a=0. 20 in Equation (133), 1^ denotes the set of all partitions of 

By summing over all pairs (t, n) consistent with (f, a) it a 
follows that 



(T,n)t<f,a> 



V i=l /,=! j=\ * /. fl/ *0 



(125) 
(126) 



The factors of <)>,! in Equation (126) arise because there are 
n^ty,! equally probable terms in the sum (125). 

Parameter Reestimation Formulae 



35 



The parameter values 8 that maximize the relative objec- 
tive function R(P e , Pq) can be found by applying the 
considerations of Section 8.3.4. The counts c(co; a, f, e) of 
Equation (38) are 40 



dj\Um;aJ,c) = b(i,aj), 



(127) 



(128) 
(129) 



45 



Recall that a partition of $ is a decomposition of as a 
sum of positive integers. For example, ty=5 has 7 partitions 
since . 1+1+1+1+1=1+1+1+2=1+1+3=1+2+2=1+4=2+3=5. 
For a partition y, let y k be the number of times that k appears 
in the sum, so that ^L^^k^ If y is the partition corre- 
sponding to 1+1+3, then y,=2, 73=1, and y k =0 for k other 
than 1 or 3. Here the convention that T 0 consists of the single 
element y with y k =Q for all k has been adopted. 

Equation (133) allows us to compute the counts 5 0 (<|>le; f, 
e) in 0(lnvH|>g) operations, where g is the number of 
pa rtition s of <|>. Although g grows with $ like (4V^()))" 1 exp 
W2$73, it is manageably small for small <j>. For example, 
<t>=10 has 42 partitions. 



The parameter reestimation formulae for t(fle), a(jli, 1, m), 50 
and t(<(>le) are obtained by using these counts in Equations 
(42H46). 

Equation (44) requires a sum over alignments. If (P 0 
satisfies 



m 



. (130) 



is the case for Models 1 and 2 (see Equation (107)), then this ^ 
sum can be calculated explicitly for Co(fle, f, e) and Ce(jK; f, 
e): 



(131) 



Proof of Formula (133) 
Introduce the generating functional 



9=u 



55 where x is an ^determinant. Then 



$=Oa\=0 



(136) 



(137) 



a*TF0y=i r=l 



65 



t=l ai=0 OofO j=i 



(138) 



siscMd z. 

i=l a i=0 



73 



-continued 

... Ix pbW*^ 

O^Ojbl 



m / 
*=1 /=1 a=0 



= Z8CMd * a 

1=1 i=l 



5,477,451 



(139) 



(140) 



10.5 Model 4 



74 

Parameters 

translation probabilities 
fertility probabilities 
fertility probabilities for cq 



10 



(141) 



d\ (A/1 «^ , ^ ) distortion probabilities for movement 

of the first word of a tablet 
d<i(A/l B ) distortion probabilities for c 
of other words of a tablet 



= Zflfet,) K 
•=1 /=1 



(l-pe(<W») * (l+p^W. 



(142) 



15 



To obtain Equation (138) rearrange the order of summa- 
tion and sum over <b to eliminate the 5-funcuon of To 
obtain Equation (139), note that (j^l^^i, a,) and so 
x*=n /a , l m x 6(W) . To obtain Equation (140), interchange the 
order of the sums on Bj and the product on j. To obtain 
Equation (141), note that in the sum on a, the only term for 
which the power of x is nonzero is the one for which a=i. 

Now note that for any ^determinates x, y lf y 2 , . . . , y m , 



Here A, is an integer, >Jis an English class; and ftis a 
French class. 



25 



General Formulae 
Assumptions 



048) 
(149) 



m 

71 



(143) 



where zk= ^ — £ 0*)*. 
This follows from the calculation 



(144) 



35 



(145) 



ti (l+y>x) = exp Elog(l+y,x) = 



exp Z Z r 

*>1 fc=l * 



= exp Z zp= I -r ( Zzix* ) 



«=0 n! YJY2 



■u.. ) 



(146) 



(147) 



45 



A=l 



v »=1 /f=l 



/»e(^)= n n '(x^ 
H> JNl 



PotTdi^.e) =-r-r- Jt n /7a(itflt) 
*°* *=1 fc=l 



where 



- C „l^(e P( )C(^)) i f j t = i 



(150) 
(151) 
(152) 

(153) 
(154) 



(xo))if*>l 



In Equation (154), p, is the first position to the left of i for 
which fy>0, and c p is the ceiling of the average position of 
the words of t ■ 



50 



*0 

pi = max{C:fy>Q},Cp = t$~ J"^^ 

H> TeOa=i 55 This model is deficient, since 



(155) 



The reader will notice that the left-hand side of Equation 
(145) involves only powers of x up to m, while the subse- 
quent Equations (146H147) involve all powers of x. ITris is 
because the z k are not algebraically independent. In fact, for 50 
<|»m, the coefficient of x* on the right-hand side of Equation 
(147) must be zero. It follows that can be expressed as a 
polynomial in z* k=l, 2, . . . , m. 

Equation (143) can be used to identify the coefficient of 
x* in Equation ( 1 42). Equation ( 1 33) followed by combining 55 
Equations (142), (143), and the Definitions (134H136) and 
(144). 



/Wotaert^) a 1 - Z P&(wftA<0 > 0. 
ti 



(156) 



Note that Equations (150), (151), and (153) are identical 
to the corresponding formulate (118), (119), and (121) for 
Model 3. 

Generation 

Equations (148M152) describe the following process for 
producing f or failure from e: 1.-4. Choose a tableau t by. 
following Steps 1-4 for Model 3. 



5,477,451 



75 



5. For each i=l, 2, ... v 1 and each k=l, 2, . . . , $ t choose 
a position n A as follows. 

If k=l then choose it n according to the distribution 

di(^-V.4(epJ B(x«)). 5 
If k>l then choose % greater than according to the 

distribution d>, (^-7^,)! B fro)). 

6, -8, Finish generating f by following Steps 6-8 for l0 
Model 3. 

10.6 Model 5 



du£l-joz~i\ B t v,v*) distortion probabihries for choosing left 
or right movement 

distortion probabilities for movement of 
other words of a tablet 



Parameters 

translation probabilities 
fertility probabilities 
fertility probabilities for cq 
distortion probabilities for leftward move- 
ment of the first word of a tablet 
distortion probabilities for rightward move* 
ment of the first word of a tablet 



d>wB, v) 



15 



20 



25 



30 



76 

-continued 



(164) 



<iijKright!^(tn).va(cp/),v,i(m)) 

dj&iti) - v n (cp,)l^(ta),v ( i(m) - vn(cpi)) 

if 4= 1 and; § c# 

JLfiOcftl ^{T/i) ? v n (cpi),v/i(m)) 

<fc(vn(cpi) - vnOOl & faiXvn(cp/)) 

ifA=land;<Cpi 

4>i(va(/) - vo<7ia_i>l B (tii).Vft(m) - v^ittt-i)) 
ifk>\ 



In Equation (164), p, is the first position to the left of i 
which has a non-zero fertility; and c p is the ceiling of the 
average position of the words of tablet p (see Equation 
(155)). Also, € tt (j) is 1 if position j is vacant after all the 
words of tablets i'<i and the first k-1 words of tablet i have 
been placed, and 0 otherwise. u tt (j) is the number of 
vacancies not to the right of j at this time Vaf^Z/sfyf) 1 )- 

This model is not deficient Note that Equations (159), 
(160), and (163) are identical to the corresponding formulae 
for Model 3. 

N.B. In the previous section, simplified embodiment of 
this model in which the probabilities d u do not appear is 
described. 



Here u, u=l, 2, . . . , m; and l_or_r=left or right 



General Formula 



Assumptions 



V i=l / |=i 



/ to 

/»eW^)= n K tinned 



where 



M 4=1 



35 



(157) 40 
(159) 

45 



(159) 



50 



(160) 



55 



(161) 



60 



(162) 



(163) 



65 



Generation 

Equations (157)-(161) describe the following process for 
producing f from e: 

I. -4. Choose a tableau t. by following Steps 1-4 for 
Model 3. 

5. For each i=l, 2, . . . , 1 and each k=l, 2, . . . , <{>,. choose 
a position as follows: 

If k=l then 

(a) Choose left or right according the distribution 

d„(l_ar_jr1iJ('i n ) f v n )(c p/ ), u n (m)). 

(b) If right, then choose a vacant position n n greater 
than or equal to c p{ according to the distribution 

d*(M%H>n(Cpi)l Bilii), W n (mM>n(c p< )). 

(c) Otherwise, choose a vacant position % less than or 
equal to c p< according to the distribution 

<WUn)M-u n (* n )l Z?(t n )> %M). 
If k>l then choose a vacant position % greater than 
according to the distribution d >1 (\) ik (7i flt )~'U flk (n l ^,)l 

/?(**), MmJ-WVi)). 

6. -8. Finish generating f by following Steps 6-8 for 
Model 3. 

II. SENSE DISAMBIGUATION 
11.1 Introduction 

An alluring aspect of the statistical approach to machine 
translation is the systematic framework it provides for 
attacking the problem of lexical disambiguation. For 
example, an embodiment of the machine translation system 
depicted in FIG. 4 translates the French sentence Je vais 
prendre la decision as I will make the decision, correctly 
interpreting prendre as make. Its statistical translation 



5,477,451 



77 



78 



model, which supplies English translations of French words, 
prefers the more common translation take, but its Ingram 
language model recognizes that the three-word sequence 
make the decision is much more probable than take the 
decision. 5 

This system is not always so successful. It incorrectly 
renders Je vais prendre ma propre decision as I will take my 
own decision. Its language model does not realize that take 
my own decision is improbable because take and decision no 
longer fall within a single digram. 1Q 

Errors such as this are common because the statistical 
models of this system only capture local phenomena; if the 
context necessary to determine a translation falls outside the 
scope of these models, a word is likely to be translated 
incorrectiy. However, if the relevant context is encoded 
locally, a word can be translated correctly. 15 

As has been noted in Section 3, such encoding can be 
performed in the source- transduction phase 701 by a sense- 
labeling transducer. 

In this section, the operation and construction of such a 
sense-labeling transducer is described The goal of this 20 
transducer is perform cross-lingual word-sense labeling. 
That is, this transducer labels words of a sentence in a source 
language so as to elucidate their translations into a target 
language. Such a transducer can also be used to label the 
words of an target sentence so as to elucidate their transla- 25 
tions into a source language. 

The design of this transducer is motivated by some 
examples. In some contexts the French verb prendre trans- 
lates into English as to take, but in other contexts it translates 
as to make. A sense disambiguation transformation, by 
examining the contexts, might label occurrences of •prendre 
that likely mean to take with one label, and other occur- 
rences of prendre with another label. Then the uncertainty in 
the translation of prendre given the label would be less than 
the uncertainty in the translation of prender without the 
label. Although the label does not provide any information 
that is not already present in the context, it encodes this 
information locally. Thus a local statistical model for the 
transfer of labeled sentences should be more accurate than 
one for the transfer of unlabeled ones. 

While the translation of a word depends on many words 40 
in its context, it is often possible to obtain information by 
looking at only a single word. For example, in the sentence 
Je vais prender ma propre decision (I will make any own 
decision), the verb prendre should be translated as make 
because its object is decision is replaced by voiture then 45 
prendre should be translated as take: Je vais prendre ma 
propre voiture (I will take my own car). Thus the uncertainty 
in the translation of prendre is reduced by asking a question 
about its object, which is often the first noun to its right. A 
sense can be assigned to prendre based upon the answer to 50 
this question. 

As another example, in II doute que les nfitres gagnent 
(He doubts that we will win), the word il should be translated 
as he. On the other hand, if doute is replaced by faut then il 
should be translated as it: U faut que les ndtres gagnent (It 55 
is necessary that we win). Here, a sense label can be 
assigned to i) by asking about the identity of the first verb to 
its right 

These examples motivate a sense-labeling scheme in 
which the label of a word is determined by a question about 
an informant word in its context. In the first example, the 60 
informant of prendre is the first noun to the right; in the 
second example, the informant of il is the first verb to the 
right. The first example is depicted in FIG. 40. The two 
sequence of this example are labeled 4001 and 4002 in the 
figure. 65 

If more than two senses are desired for a word, then 
questions with more than two answers can be considered. 



35 



11.2 Design of a Sense-Labeling Transducer 

FIG. 41 depicts an embodiment of a sense-labeling trans- 
ducer based on this scheme. For expositional purposes, this 
embodiment will be discussed in the context of labeling 
words in a source language word sequence. It should be 
understood that in other embodiments, a sense-labeling 
transducer can accept as input more involved source-struc- 
ture representations, including but not limited to, lexical 
morpheme sequences. In these embodiments, the constitu- 
ents of a representation of a source word sequence are 
annotated by the sense-labeling transducer with labels that 
elucidate their translation into analogous constituents into a 
similar representation of a target word sequence. It should 
also be understood that in still other embodiments, a sense- 
labeling transducer annotates, target-structure representa- 
tions, (not source-structure representations) with sense 
labels. 

The operation of the embodiment of the sense-labeling 
transducer depicted in FIG. 41 comprises the steps of: 

4101. Capturing an input sequence consisting of a 
sequence of words in a source language; 

4102. For each word in the input sequence performing the 
Steps 4107, 4108, 4109, 4104 until no more words are 
available in Step 4105; 

4107. For the input word being considered, finding a best 
informant site such as the noun to the right, the verb to the 
left, etc. A best informant for a word is obtained using a table 
4103 stored in memory of informants and questions about 
the informants for every word in the source language 
vocabulary; 

4108. Find the word at the best informant site in the input 
word sequence; 

4109. Obtaining the class of the informant word as given 
by the answer to a question about the informant word; 

4104. Labeling the original input word of the input 
sequence with the class of the informant word. 

For the example depicted in FIG. 40, the informant site 
determined by Step 41 07 is the noun to the right. For the first 
word sequence 4001 of this example, the informant word 
determined by Step 4108 is decision; for the second word 
sequence 4109, the informant word is voiture. In this 
example, the class of decision determined in Step 4109 is 
different than the class of voiture. Hius the label attached to 
prendre by Step 4104 is different for these two contexts of 
prendre. 

11.3 Construction a Table of Informants and Questions 
An important component of the sense-labelcr depicted in 

FIG. 4104 is a table 4103 of informants and questions for 
every word in a source language vocabulary. 

FIG. 42 depicts a method of constructing such a table. 
This method comprises the steps of: 

4203. Pcrrforrning the Steps 4201 and 4202 for each word 
in a source language vocabulary. 

4201. For the word being considered, finding a good 
question for each of a plurality of informant sites. These 
informant sites are obtained from a table 4207 stored in 
memory of possible informant sites. Possible sites include 
but are not limited to, the nouns to the right and left, the 
verbs to the right and left, the words to the right and left, the 
words two positions to the right or left, etc. A method of 
finding a good question is described below. This method 
makes use of a table 4205 stored in memory probabilities 
derived from Viterbi alignments. These probabilities are also 
discussed below. 

4202. Storing in a table 4206 of informants and questions, 
the informant site and the good question. 

A method of carrying out the Step 4201 for finding a good 
question for each information site of a vocabulary word is 
depicted in FIG. 43. This method comprises the steps of: 



5,477,451 



79 



4301. Performing the Steps 4302 and 4207 for each 
possible informant site. Possible informant sites are obtained 
from a table 4304 of such sites. 

4302. For the informant site being considered, finding a 
question about informant words at this site that provides a lot 
of information about translations of the vocabulary word 
into the target language. 

4207. Storing the vocabulary word, the informant site, 
and the good question in a table 4103. 

11.4 Mathematics of Constructing Questions 
A method for carrying out the Step 4302 of finding a 
question about an informant is depicted in FIG. 44. In this 
subsection, the some preliminaries for describing this 
method are given. The notation used here is the same as that 
used in Sections 8-10. 

11.4.1 Statistical Translation with Transductions 
Recall the setup depicted in FIG. 2. A system shown there 
employs 

1. A source transducer 201 which encodes a source 
sentence f into an intermediate structure f . 

2. A statistical translation component 202 which translates 
f into a corresponding intermediate target structure e\ This 
component incorporates a language model, a translation 
model, and a decoder. 

3. A target transducer 203 which reconstructs a target 
sentence e from e\ 

For statistical modeling, the target-translation transforma- 
tion 203 e'^e is sometimes required to be invertible. Then 
e' can be constructed from e and no information is lost in the 
transformation 

The purpose of source and target transduction is to 
facilitate the task of the statistical translation. This will be 
accomplished if the probability distribution Pr(f ,e') is easier 
to model then the original distribution Pr(f,e). In practice 
this means that e' and f should encode global linguistic facts 
about e and f in a local form. 

A useful gauge of the success that a statistical model 
enjoys in modeling translation from sequences of source 
words represented by a random variable F, to sequences of 
target words represented by a random variable E, is the cross ^ 
entropy 3 

s in this equation and in the remainder of this section, the convention is using 
uppercase letters (e.g. E) for random variables and lower case letters (e.g. e) 
for the values of random variables continues to be used. 



10 



15 



20 



25 



30 



35 



(165) 



45 



The cross entropy measures the average uncertainty that 
the model has about the target language translation e of a 
source language sequence f. Here P(elf) is the probability 
according to the model that e is a translation of f and the sum 
runs over a collection of all S pairs of sentences in a large 
corpus comprised of pairs of sentences with each pair 
consisting of a source and target sentence which are trans- 
lations of one another (See Sections 8-10). 

A better model has less uncertainty and thus a lower cross 
entropy. The utility of source and target transductions can be 
measured in terms of this cross entropy. Thus transforma- 
tions f— »f and e'-KJ are useful if models P(f le") and F(e') 
can be constructed such that H(E'IF)<H(EIF). 

11.4.2 Sense-Labeling in Statistical Translation 

The remainder of this section is devoted to describing 
methods for constructing a sense-labeling transducer In this 
case the following setup prevails: 

The Intermediate Structures. Intermediate structures e* 
and f consist of sequences of words labeled by their senses. 
Thus f is a sentence over the expanded vocabulary whose 



80 



'words 1 f are pairs (f, s) where f is a word in the original 
source vocabulary and s is its sense label. Similarly, e' is a 
sentence over the expanded vocabulary whose words e' are 
pairs (e, s) where e is a target word and s is its sense label. 

Source and target transductions. For each source word and 
each target word, an informant site, such as first noun to the 
left is chosen, and an n-ary question about the value of the 
informant at that site is also chosen. A source-transduction 
transformation i__»f and an inverse target- transduction 
transformation i_+z' map a sentence to the intermediate 
structure in which each word is labeled by a sense, deter- 
mined by the question about its informant A target-trans- 
duction transformation e't_+e maps a labeled sentence to a 
sentence in which the labels have been removed 

The probability models. A translation model such as one 
of the models in Sections 8-10 is used for both P'(FIE') and 
for P(FIE). A Ingram language model such as that discussed 
in Section 6 is used for both P(E) and P'(E'). 

11.4.3 The Viterbi Approximation 

The probability P(fle) computed by the translation model 
requires a sum over alignments as discussed in detail in 
Sections 8-10. This sum is often too expensive to compute 
directly since the number of alignments increases exponen- 
tially with sentence length. In the mathematical consider- 
ations of this Section, this sum will be approximated by the 

single term corresponding to the alignment, V(fte), with 
greatest probability. This is the Viterbi approximation 

already discussed in Sections 8-10 and V(f(le) is the Viterbi 
alignment 

Let c(fie) be the expected number of time that e is aligned 
with f in the Viterbi alignment of a pair of sentences drawn 
at random from a large corpus of training data. Let (aj>le) be 
the expected number of times that e is aligned with § words. 
Then 



(166) 



where c(fle; V) is the number of times that e is aligned with 
f in the alignment A, and c(<|>le; V) is the number f times that 
e generates <)>target words in A. The counts above are also 
expressible as averages with respect to the model: 



50 



(167) 



55 



60 



65 



Probability distributions p(e,f) and p(<j>, e) are obtained by 
normalizing the counts c(fle) and c(<j>le): 



• <*>,e). 



(168) 



4 In these equations and in the remainder of the paper, the generic symbol 



is used to denote a normalizing factor that converts counts 
to probabilities. The actual value of 



5,477,451 



81 



will be implicit from the context Thus, for example, in the 
left hand equation of (168), the normalizing factor is norm= 
£ /e c(fle) which equals the average length of source sen- 
tences. In the right hand equation of (168), the normalizing 
factor is the average length of target sentences. 
(These are the probabilities that are stored in a table of 
probabilities 4205.) The conditional distributions p(f)e) and 
p(<j>le) are the Viterbi approximation estimates for the param- 
eters of the model. The marginals satisfy 



1 z, 



tf(flE)=*n{f/(£IF>f //($!£)}, 



where m is the average length of the source sentences in the 
training data, and H(FE) and H((J>IE) are the conditional 
entropies for the probability distributions p(f,e) and p(<)>,e): 



//(*»!£) = 



-: Z pV.e)\ogp<fie) 
I* 



A similar expression for the cross entropy H(EIF) will 
now be given. Since 

P<f.e>P(flc)P{e) t 

this cross entropy depends on both the translation model, 
P(fle), and the language model, P(e). With a suitable addi- 
tional approximation, 



//(EIF)=ro{ //(<►!£)-£(£, 



10 



(169) 



15 



where (u(e) and u(f) are the unigram distributions of e and 
f and $(e>=2^p($le) $ is the average number of source words 
aligned with e. These formulae reflect the fact that in any 
alignment each source word is aligned with exactly one 
target word. 25 
11.4.4 Cross Entropy 

In this subsection the cross entropies (H(EIF) and H(E'IF) 
are expressed in terms of the information between source 
and target words. 

In the Viterbi approximation, the cross entropy H(FE) is 30 
given by, 



(170) 



35 



(171) 



40 



45 



(172) 



where H(E) is the cross entropy of P(E) and I(F3) is the 
mutual information between f and e for the probability 
distribution p(f,e). 
The additional approximation required is, 



H(F) m mH(F) = -m L p(01ogp(/), 



(173) 60 



where p(f) is the marginal of p(f,e). This amounts to approxi- 
mating P(f) by the unigram distribution that is closest to it 
in cross entropy. Granting this, formula (172) is a conse- 
quence of (170) and of the identities 



65 



(174) 



82 



Next consider H(E'IF). Let e->e' and f->f be sense 
labeling transformations of the type discussed above. 
Assume that these transformations preserve Viterbi align- 
ments; that is, the words e and f are aligned in the Viterbi 
alignment for (f,e), then their sensed versions e' and f axe 
aligned in the Viterbi alignment for (f , e'). It follows that the 
word translation, probabilities obtained from the Viterbi 
alignments satisfy p(f,e)=2: /</ p(f .e^E^p&e'), where the 
sums range over the sensed versions f of f and the sensed 
versions e' of e. 

By applying (172) to the cross entropies H(EIF), H(EIF), 
and H(E'IF), it is not hard to verify that 

H(BF) = H{BF) - m Z p(/)/(£/Hfl, < 175) 



20 



H{E\F ) = H{EIF) - m X p(e){l(F.£U) + /(<&,£»}. 
e 



Here I(E,Flf) is the conditional mutual information given 
a source word f between its translations E and it sensed 
versions F; I(F,E'le) is the conditional mutual information 
given a target word e between its translations F and its 
sensed versions E'; and Ityfle) is the conditional mutual 
information given e between <(> and its sensed versions E\ 

11.5 Selecting Questions 

The method depicted in 44 for finding good informants 
and questions for sensing is now described. 
11.5.1 Source Questions 

For sensing source sentences, a question about an infor- 
mant is a function t from the source vocabulary into the set 
of possible senses. If the informant of f is x, then f is 
assigned the sense 6(x). The function d(x) is chosen to 
minimize the cross entropy H(EIF). From formula (175), 
this is equivalent to maximizing the conditional mutual 
information I(F,Elf) between E and F 



(176) 



where p(f,e,x) is the probability distribution obtained by 
counting the number of times in the Viterbi alignments that 
e is aligned with f and the value of the informant of f is x, 



50 



Ic^Jd^)) 

5 



(177) 



P<f,c.e) = - 



1 



DOfXD X ctx)=c 



55 



An exhaustive search for the best £ requires a computation 
that is exponential in the number of values of x and is not 
practical. In the aforementioned paper entitled "Word-Sense 
Disambiguation using Statistical Methods" by the P. F. 
Brown, et al., a good 6 is found using the flip-flop method 
which is only applicable if the number of senses is restricted 
to two. Here a different method that can be used to find £ for 
any number of senses is described This method uses the 
technique of alternating minimization, and is similar to the 
k-means method for determining pattern clusters and to the 
generalized Lloyd method for designing vector quantizers. 



5,477,451 



83 



The method is based on the fact that, up to a constant 
independent of 6, the mutual information I(f ,Elf) can be 
expressed as an infimum over conditional probability dis- 
tributions q(Elc), 



q x 

where 



D(p(£) ;<7 (£))aSp( e )log^-, 



(178) 



10 



(179) 



The best value of the information is thus an infimum over 
both the choice for 5 and the choice for the q. This suggests 
the iterative method, depicted in 4401 for obtaining a good 
S. This method comprises the steps of: 

4401. Beginning with an initial choice of c; 

4404. Performing Steps 4402 and 4403 until no further 
increase in I(F,Elf) results; 
4403. For given q, finding the best £: 

cix^rgminJXplElxJ); qiE\c))', 

4402. For this d, finding the best q: 



1 



11.5.2 Target Questions 

For sensing target sentences, a question about an infor- 
mant is a function of S from the target vocabulary into the 
set of possible senses, t is chosen to minimize the entropy 
H(E1F). From (175) this is equivalent to maximizing the 
sum 

In analogy to (179), 



I(F, E\e) + /(O^e) = 



15 



20 



25 



30 



35 



40 



(180) 



inf Zp(x){D{p{FU.e); q\{F\c(x)) + D(p(<J>U,e); 92(<Wc(x))}. 
x 

Again a good £ is obtained alternating mimmization. 
11.6 Generalizations 

The methods of sense-labeling discussed above ask a 
single question about a single word of context. In other 
embodiments of the sense labeler, this question is the first 
question in a decision tree. In still other embodiments, rather 
than using a single informant site to determine the sense of 
a word, questions from several different informant sites are 
combined to determine the sense of a word. In one embodi- 
ment, this is done by assuming that the probability of an 
informant word x t at informant site i, given a target word e, 
is independent of an informant word x, at a different infor- 
mant site j given the target word e. Also, in other embodi- 
ments, the intermediate source and target structure repre- 
sentations are more sophisticated than word sequences, 
including, but not limited to, sequences of lexical mor- 
phemes, case frame sequences, and parse tree structures. 



45 



50 



55 



60 



65 



84 



12 ALIGNING SENTENCES 

In this section, a method is described for aligning sen- 
tences in parallel corpora, and extracting from parallel 
corpora pairs of sentences which are translations of one 
another. These tasks are not trivial because at times a single 
sentence in one corpora is translated as two or more sen- 
tences in the other corpora. At other times a sentence, or 
even a whole passage, may be missing from one or the other 
of the corpora. 

A number of researchers have developed methods that 
align sentences according to the words that they contain. 
(See for example, Deriving translation data from bilingual 
text, by R. Canzone, G. Russel, and S. Warwick, appearing 
in Proceedings of the First International Acquisition Work- 
shop, Detroit, Mich. 1989; and "Making Connections", by 
M. Kay, appearing in ACH/ALLC '91, Tempe, Ariz. 1991.) 
Unfortunately, these methods are necessarily slow and, 
despite the potential for high accuracy, may be unsuitable for 
very large collections of text. 

In contrast, the method described here makes no use of 
lexical details of the corpora. Rather, the only information 
that it uses, besides optional information concerning anchor 
points, is the lengths of the sentences of the corpora. As a 
result, the method is very fast and therefore practical for 
application to very large collections of text. 

The method was used to align several million sentences 
from parallel French and English corpora derived from the 
proceedings of the Canadian Parliament The accuracy of 
these alignments was in excess of 99% for a randomly 
selected set of 1000 alignments that were checked by hand. 
The correlation between the lengths of aligned sentences 
indicates that the method would achieve an accuracy of 
between 96% and 97% even without the benefit of anchor 
points. This suggests that the method is applicable to a very 
wide variety of parallel corpora for which anchor points are 
not available. 

12.1 Overview 

One embodiment of the method is illustrated schemati- 
cally in FIG. 46. It comprises the steps of: 
. 4601 and 4062. Tokenizing the text of each corpus. 

4063 and 4064. Determining sentence boundaries in each 
corpus. 

4065. Determining alignments between the sentences of 
the two corpora. 

The basic step 4065 of determining sentence alignments 
is elaborated further in FIG. 47. It comprises the steps of: 

4701. Finding major and minor anchor points in each 
corpus. This divides each corpus into sections between 
major anchor points, and subsections between minor anchor 
points. 

4072. Deterrnining alignments between major anchor 
points. 

4073. Retaining only those aligned sections for which the 
number of subsections is the same in both corpora. 

4074. Determining alignments between sentences with 
each of the remaining aligned subsections. 

One embodiment of the method will now be explained. 
The various steps will be illustrated using as an example the 
aforementioned parallel French and English corpora derived 
from the Canadian Parliamentary proceedings. These pro- 
ceedings, called the Hansards, are transcribed in both 
English and French. The corpora of the example consists of 
the English and French Hansard transcripts for the years 
1973 through 1986. 

It is understood that the method and techniques illustrated 
in this embodiment and the Hansard example can easily be 
extended and adapted to other corpora and other languages. 



5,477,- 

85 

12.2 Tokenization and Sentence Detection 

First, the corpora are tokenized (steps 4601 and 4602) 
using a finite-state tokenizer of the sort described in Sub- 
section 3.2.1. Next, (steps 4602 and 4603), the corpora are 
partitioned into sentences using a finite state sentence 5 
boundary detector. Such a sentence detector can easily be 
constructed by one skilled in the art Generally, the sentences 
produced by such a sentence detector conform to the grade- 
school notion of sentence: they begin with a capital letter, 
contain a verb, and end with some type of sentence-final 1Q 
punctuation. Occasionally, they fall short of this ideal and 
consist merely of fragments and other groupings of words. 

In the Hansard example, the English corpus contains 
85,016,286 tokens in 3,510,744 sentences, and the French 
corpus contains 97,857,452 tokens in 3,690,425 sentences. . 
The average English sentence has 24.2 tokens, while the 15 
average French sentence is about 9.5% longer with 26.5 
tokens. The left-hand side of FIG. 48 shows the raw data for 
a portion of the English corpus, and the right-hand side 
shows the same portion after it was cleaned, tokenized, and 
divided into sentences. The sentence numbers do not 20 
advance regularly because the sample has been edited in 
order to display a variety of phenomena. 

12.3 Selecting Anchor Points 

The selection of suitable anchor points (Step 4071) is a 
corpus-specific task. Some corpora may not contain any 25 
reasonable anchors. 

In the Hansard example, suitable anchors are supplied by 
various reference markers that appear in the transcripts. 
These include session numbers, names of speakers, time 
stamps, question numbers, and indications of the original 
language in which each speech was delivered. This auxiliary 
information is retained in the tokenized corpus in the form 
of comments sprinkled throughout the text Each comment 
has the form \SCM{} . . . \ECM{} as shown on the 
right-hand side of FIG. 48. 

To supplement the comments which appear explicitly in 35 
the transcripts, a number of additional comments were 
added. Paragraph comments were inserted as suggested by 
the space command of the original markup language. An 
example of this command appears in the eighth line of the 
left-hand side of FIG. 48. The teginning of a parliamentary 40 
session was marked by a Document comment, as illustrated 
in Sentence 1 on the right-hand side of FIG. 48. Usually, 
when a member addresses the parliament, his name is 
recorded, This was encoded as an Author comment, an 
example of which appears in Sentence 4. If the president 45 
speaks, he is referred to in the English corpus as Mr. Speaker 
and in the French corpus as 

TABLE 18 



Examj 


)les of comments 


English 


French 


Source = English 


Source = Traduction 


Source = Translation 


Source — Francnis 


Source = Text 


Source ~ Tex te 


Source = list Item 


Source = list Item 


Source = Question 


Source = Question 


Source = Answer 


Source = Repouse 



55 



M. le President. If several members speak at once, a 60 
shockingly regular occurrence, they are referred to as Some 
Hon. Members in the English and as Des Voix in the French. 
Times are recorded either as exact times on a 24-hour basis 
as in Sentence 81 , or as inexact times of which there are two 
forms: Time=Later, and Hme=Recess. These were encoded 65 
in the French as TimexPlus Tard and Time=Rccess. Other 
types of comments are shown in Table 18. 



86 

The resulting comments laced throughout the text are 
used as anchor points for the alignment process. The com- 
ments Author=Mr. Speaker, Author=M. le President, 
Author=Some Hon. Members, and Author=Des Voix are 
deemed minor anchors. All other comments are deemed 
major anchors with the exception of the Paragraph comment 
which was not treated as an anchor at all. The minor anchors 
are much more common than any particular major anchor, 
making an alignment based on minor anchors much less 
robust against deletions than one based on the major 
anchors. 

12.4 Aligning Major Anchors 

Major anchors, if they are to be useful, will usually appear 
in parallel in two corpora Sometimes, however, through 
inattention on the part of translators or other misadventure, 
an anchor in one corpus may be garbled or omitted in 
another. In the Hansard example, for instance, this is prob- 
lem is not uncommon for anchors based upon names of 
speakers. 

The major anchors of two corpora are aligned (Step 4072) 
by the following method. First, each connection of an 
alignment is assigned a numeral cost that favors exact 
matches and penalizes omissions of garbled matches. In the 
Hansard example, these costs were chosen to be integers 
between 0 and 10. Connections between corresponding pairs 
such as Hme=Later and Time=Plus Tard, were assigned a 
cost of 0, while connections between different pairs such as 
Hme=Later and Author=Mr. Bateman were assigned a cost 
of 10. A deletion is assigned a cost of 5. A connection 
between two names were assigned a cost proportional to the 
minimal number of insertions, deletions, and substitutions 
necessary to transform one name, letter by letter, into the 
other. 

Given these costs, the standard technique of dynamic 
programming is used to find the alignment between the 
major anchors with the least total cost Dynamic program- 
ming is described by R. Bellman in the book titled Dynamic 
Programming, published by Princeton University Press, 
Princeton, NJ. in 1957. In theory, the time and space 
required to find this alignment grow as the product of the 
lengths of the two sequences to be aligned. In practice, 
however, by using thresholds and the partial traceback 
technique described by Brown, Spohrer, Hochschild, and 
Baker in their paper, Partial Traceback and Dynamic Pro- 
gramming, published in the Proceedings of the IEEE Inter- 
national Conference on Acoustics,. Speech and Signal Pro- 
cessing, in Paris, France in 1982, the time required can be 
made linear in the length of the sequences, and the space can 
be made constant. Even so, the computational demand is 
severe. In the Hansard example, the two corpora were out of 
alignment in places by as many as 90,000 sentences owing 
to mislabelled or missing files. 

12.5 Discarding Sections 

The alignment of major anchors partitions the corpora 
into a sequence of aligned sections. Next, (Step 4703), each 
section is accepted or rejected according to the population of 
minor anchors that it contains. Specifically, a section is 
accepted provided that, within the section, both corpora 
contain the same number of minor anchors in the same order. 
Otherwise, the section is rejected. Altogether, using this 
criteria, about 10% of each corpus was rejected. The minor 
anchors serve to divide the remaining sections into subsec- 
tions that range in size from one sentence to several thou- 
sand sentences and average about ten sentences. 

12.6 Aligning Sentences 



5,477,451 



87 

The sentences within a subsection are aligned (Step 4074) 
using a simple statistical model for sentence lengths and 
paragraph markers. Each corpus is viewed as a sequence of 
sentence lengths punctuated by occasional paragraph mark- 
ers, as illustrated in FIG. 49. In this figure, the circles around 5 
groups of sentence lengths indicate an alignment between 
the corresponding sentences. Each group is called a bead. 
The example consists of an ef-bead followed by an eff-bead 
followed by an e-bead followed by a fl^bead. From this 
perspective, an alignment is simply a sequence of beads that lQ 
accounts for the observed sequences of sentence lengths and 
paragraph markers. The model assumes that the lengths of 
sentences have been generated by a pair of random pro- 
cesses, the first producing a sequence of beads and the 
second choosing the lengths of the sentences in each bead. 

The length of a sentence can be expressed in terms of the 15 
number of tokens in the sentence, the number of characters 
in the sentence, or any other reasonable measure. In the 
Hansard example, lengths were measured as number of 
tokens. 

The generation of beads is modelled by the two-state 20 
Markov model shown in FIG. 50. The allowed beads are 
shown in FIG. 51. A single sentence in one corpus is 
assumed to line up with zero, one, or two sentences in the 
other corpus. The probabilities of the different cases are 
assumed to satisfy Pr(e)=Pr(f), Pr(efi>Pr(eef), and PrflU= 25 

The generation of sentence lengths given beads is mod- 
eled as follows. The probability of an English sentence of 
length 1, given an e-bead is assumed to be the same as the 
probability of an English sentence of length \ e in the text as 
a whole. This probability is denoted by Pr(l e ). Similarly, the 
probability of a French sentence of length \ f given an f-bead 
is assumed to equal Pr(ly). For an ef-bead, the probability of 
an English sentence of length 1, is assumed to equal Pr(l e ) 
and the log of the ratio of length of the French sentence to 
the length of the English sentence is assumed to be normally 35 
distributed with mean u and variance o 2 . Thus, if r=logfl/l e ), 
then 

Pr(l/l € )=a ttpH^mo 2 )), (181) 

40 

with a chosen so that the sum of PrQyJl,) over positive values 
of ly is equal to unity. For an eef-bead, the English sentence 
lengths are assumed to be independent with equal marginals 
Pr(l e ), and the log of the ratio of the length of the French 
sentence to the sum of the lengths of the English sentences 45 
is assumed to be normally distributed with the same mean 
and variance as for an ef-bead Finally, for an eff-bead, the 
probability of an English length \ e is assumed to equal Pr(l J 
and the log of the ratio of the sum of the lengths of the 
French sentences to the length of the English sentence is 50 
assumed to be normally distributed as before. Then, given 
the sum of the lengths of the French sentences, the prob- 
ability of a particular pair of lengths, ly, and 1^, is assumed 
to be proportional to Prfly^PrO^). 

Together, the model for sequences of beads and the model 55 
for sentence lengths given beads define a hidden Markov 
model for the generation of aligned pairs of sentence 
lengths. Markov Models are described by L. Baum in the 
article "An Inequality and associated maximization tech- 
nique in statistical estimation of probabilistic functions of a 60 
Markov process", appearing in Inequalities in 1972. 

The distributions Pr(l e ) and Prfly) are determined from the 
relative frequencies of various sentence lengths in the data. 
For reasonably small lengths, the relative frequency is a 
reliable estimate of the corresponding probability. For 65 
longer lengths, probabilities are determined by fitting the 
observed frequencies of longer sentences to the tail of a 



88 

Poisson distribution. The values of the other parameters of 
the Markov model can be determined by from a large 

TABLE 19 

Parameter es timate s 



Parameter E s tim ate 



Pr(e), Prtf) 007 

Pr(eft .690 

Pr<eef), Pr(e/f) .020 

Pr(U Prfy) .005 

Prdl/) 345 

|i .072 

0 2 .043 



sample of text using the EM method. This method is 
described in the above referenced article by E. Baum. 

For the Hansard example, histograms of the sentence 
length distributions PrO e ) and Pr(fl,en) for lengths up to 81 
are shown in FIGS. 52 and 53 respectively. Except for 
lengths 2 and 4, which include a large number of formulaic 
sentences in both the French and the English, the distribu- 
tions are very smooth. 

The parameter values for the Hansard example are shown 
in Table 19. From these values it follows that 91% of the 
English sentences and 98% of the English paragraph mark- 
ers line up one-to-one with their French counterparts. If X is 
a random variable whose log is normally distributed with 
mean \x and variance a 2 , then the mean of X is expfu+o 2 ^). 
Thus, from the values in the table, it also follows that the 
total length of the French text in an ef-, eef-, or eff-bead is 
about 9.8% greater on average than the total length of the 
corresponding English text Since most sentences belong to 
ef-beads, this is close to the value of 9.5% given above for 
the amount by which the length of the average French 
sentences exceeds that of the average English sentence. 

12.7 Ignoring Anchors 

For the Hansard example, the distribution of English 
sentence lengths shown in FIG. 52 can be combined with the 
conditional distribution of French sentence lengths given 
English sentence lengths from Equation (181) to obtain the 
joint distribution of French and English sentence lengths in 
ef-, eef-, and eff-beads. For this joint distribution, the mutual 
information between French and English sentence length is 
1.85 bits per sentence. It follows that even in the absence of 
the anchor points, the correlation in sentence lengths is 
strong enough to allow alignment with an error rate that is 
asymptotically less than 100%. 

Numerical estimates for the error rate as a function of the 
frequency of anchor points can be obtained by Monte Carlo 
simulation. The empirical distributions Pr(l c ) and PrQy) 
shown in FIGS. 52 and 53, and the parameter values from 
Table 1 9 can be used to generated an artificial pair of aligned 
corpora, and then, the most probable alignment for these 
corpora can be found. The error rate can be estimated as the 
fraction of ef-beads in the most probable alignment that did 
not correspond to ef-beads in the true alignment. 

By repeating this process many thousands of times, an 
expected error rate of about 0.9% was estimated for the 
actual frequency of anchor points in the Hansard data. By 
varying the parameters of the hidden Markov model, the 
effect of anchor points and paragraph markers on error rate 
can be explored. With paragraph markers but no anchor 
points, the expected error rate is 2.0%, with anchor pints but 
no paragraph markers, the expected error rate is 2.3%, and 
with neither anchor points nor paragraph markers, the 
expected error rate is 3.2%. Thus, while anchor points and 
paragraph markers are important, alignment is still feasible 



10 



15 



20 



25 



30 



5,477,451 

89 90 

without them. This is promising since it suggests that the as data for construction of a statistical sense-labelling mod- 
method is applicable to corpora for which frequent anchor ule 
points are not available. 

TABLE 20 



Unusual but correct alignments 



And love and kisses to you, too. ParcUlcmeni. 

. . . mugwumps who sit on the fence with ... en voulant xnenagerlach&vreetlechoux 
their mugs on one side and their Us n'arrivent pas a prendre parti, 

wumps on the other side and do not 
know which side to come down on. 
At first reading, she may have. Elk semble en effet avoir vra grief tout a 

fait valable, do moins an premier 
abord. 



12.8 Results for the Hansard Example 

For the Hansard example, the alignment method 
described above ran for 10 days on an IBM Model 3090 
mainframe under an operating system that permitted access 20 
to 16 megabytes of virtual memory. The most probable 
alignment contained 2,869,041 ef-beads. In a random 
sample of 1000 the aligned sentence pairs, 6 errors were 
found. This is consistent with the expected error rate of 0.9% 25 
mentioned above. In some cases, the method correctly 
aligned sentences with very different lengths. Examples are 
shown in Table 20. 

13 ALIGNING BILINGUAL CORPORA 

With the growing availability of machine-readable bilin- 30 
glial texts has become a burgeoning interest in methods for 
extracting linguistically valuable information from such 
texts. One way of obtaining such information is to construct 
sentence and word correspondences between the texts in the 
two languages of such corpora. 35 

A method for doing this is depicted schematically in FIG. 
45. This method comprises the steps of 

4501. Beginning with a large bilingual corpus; 

4504. Extracting pairs of sentences 4502 from this corpus 
such that each pair consists of a source and target sentence 40 
which are translations of each other; ' 

4505. Within each sentence pair, aligning the words of the 
target sentence with the words in a source sentence, to obtain 
a bilingual corpus labelled with word-by-word correspon- 
dences 4503. 

In one embodiment of Step 4504, pair of aligned sen- 
tences are extracted using the method explained in detain in 
Section 12. The method proceeds without inspecting the 
identities of the words within sentences, but rather uses only 50 
the number of words or numbers of characters that each 
sentence contains. 

In one embodiment of Step 4505, word-by-word corre- 
spondences within a sentence pair are determined by finding 
the Viterbi alignment or approximate Viterbi alignment for 55 
the pair of sentences using a translation model of the sort 
discussed in Sections 8-10 above. These models constitute 
a mathematical embodiment of the powerfully compelling 
intuitive feeling that a word in one language can be trans- 
lated into a word or phrase in another language. 60 

Word-by-word alignments obtained in this way offer a 
valuable resource for work in bilingual lexicography and 
machine translation. For example, a method of cross-lingual 
sense labeling, described in Section 11, and also in the $5 
aforementioned paper, "Word Sense Disambiguation using 
Statistical Methods", uses alignments obtained in this way 



14 Hypothesis Search - Steps 702 and 902 

14.2 Overview of Hypothesis Search 

Referring now to FIG. 7, the second step 702 produces a 
set of hypothesized target structures which correspond to 
putative translations of the input intermediate source struc- 
ture produced by step 701. The process by which these target 
structures are produced is referred to as hypothesis search. 
In a preferred embodiment target structures correspond to 
sequences of morphemes. In other embodiments more 
sophisticated linguistic structures such as parse trees or case 
frames may be hypothesized. 

An hypothesis in this step 702 is comprised of a target 
structure and an alignment of a target structure with the input 
source structure. Associated with each hypothesis is a score. 
In other embodiments a hypothesis may have multiple 
alignments. In embodiments in which step 701 produces 
multiple source structures an hypothesis may contain mul- 
tiple alignments for each source structure. It will be assumed 
here that the target structure comprised by a hypothesis 
contains a single instance of the null target morpheme. The 
null morphemes will not be shown in the Figures pertaining 
to hypothesis search, but should be understood to be part of 
the target structures nonetheless. Throughout this section on 
hypothesis search, partial hypothesis will be used inter- 
changeably with hypothesis, partial alignment with align- 
ment, and partial target structure with target structure. 

The target structures generated in this step 702 are pro- 
duced incrementally. The process by which this is done is 
depicted in FIG. 54. This process is comprised in five steps. 

A set of partial hypotheses is initialized in step 5401. A 
partial hypothesis is comprised of a target structure and an 
alignment with some subset of the morphemes in the source 
structure to be translated. The initial set generated by step 
5401 consists of a single partial hypothesis. The partial 
target structure for this partial hypothesis is just an empty 
sequence of morphemes. The alignment is the empty align- 
ment in which no morphemes in the source structure to be 
translated are accounted for. 

The system then enters a loop through steps 5402, 5403, 
and 5404, in which partial hypotheses are iteratively 
extended until a test for completion is satisfied in step 5403. 
At the beginning of this loop, in step 5402, the existing set 
of partial hypotheses is examined and a subset of these 
hypotheses is selected to be extended in the steps which 
comprise the remainder of the loop. In step 5402 the score 
for each partial hypothesis is compared to a threshold (the 
method used to compute these thresholds is described 
below). Those partial hypotheses with scores greater than 
threshold are then placed on a list of partial hypotheses to be 
extended in step 5404. Each partial hypothesis that is 



40 



45 



5,477,451 



91 



extended in step 5404 contains an alignment which accounts 
for a subset of the morphemes in the source sentence. The 
remainder of the morphemes must still be accounted for. 
Each extension of an hypothesis in step 5404 accounts for 
one additional morpheme. Typically, there are many tens or 
hundreds of extensions considered for each partial hypoth- 
esis to be extended. For each extension a new score is 
computed. This score contains a contribution from the 
language model as well as a contribution from the transla- 
tion model. The language model score is a measure of the 
plausibility a priori of the target structure associated with the 
extension. The translation model score is a measure of the 
plausibility of the partial alignment associated with the 
extension. A partial hypothesis is considered to be a full 
hypothesis when it accounts for the entire source structure to 
be translated. A full hypothesis contains an alignment in 
which every morpheme in the source structure is aligned 
with a morpheme in the hypothesized target structure. The 
iterative process of extending partial hypotheses terminates 
when step 5402 produces an empty list of hypotheses to be 
extended. A test for this situation is made on step 5403. 

This method for generating target structure hypotheses 
can be extended to an embodiment of step 902 of FIG. 9, by 
modifying the hypothesis extension step 5404 in FIG. 54, 
with a very similar step that only considers extensions which 
are consistent with the set of constraints 906. Such a 
modification is a simple matter for one skilled in the art 

14.2 Hypothesis Extension 5404 

This section provides a description of the method by 
which hypotheses are extended in step 5404 of FIG. 54. 
Examples will be taken from an embodiment in which the 
source language is French and the target language is English. 
It should be understood however that the method described 
applies to other language pairs. 

14.2.1 Types of Hypothesis Extension 

There are a number of different ways a partial hypothesis 
may be extended. Each type of the method described in this 
section assigns scores to partial hypotheses based on Model 
3 from the section entitled Translation Models and Param- 
eter Estimation. One skilled in the art will be able to adopt 
the method described here to other modes, such as Model 5, 
which is used in the best mode of operation. 

A partial hypothesis is extended by accounting for one 
additional previously unaccounted for element of the source 
structure. When a partial hypothesis H, is extended to some 
other hypothesis H 2 , the score assigned to is a product of 
the score associated with H x and a quantity denoted as the 
extension score. The value of the extension score is deter- 
mined by the language mode, the translation model, the 
hypothesis being extended and the particular extension that 
is made. A number of different types of extensions are 
possible and are scored differently. The possible extension 
types and the manner in which they are scored is illustrated 
in the examples below. 

As depicted in FIG. 55, in a preferred embodiment, the 
French sentence La jeunc fille a reveille" sa mere is trans- 
duced in either step 701 of FIG. 7 or step 901 of FIG. 9 into 
the morphological sequence la jeune fille V_ poj( _3s sa 
mere. 

The initial hypothesis accounts for no French morphemes 
and the score of this hypothesis is set to 1. This hypothesis 
can be extended in a number of ways. Two sample exten- 
sions are shown in FIGS. 56 and 57. In the first example in 
FIG. 56, the English morpheme the is hypothesized as 
accounting for the French morpheme la. The component of 
the score associated with this extension is the equal to 



10 



20 



25 



30 



35 



40 



45 



50 



55 



65 



92 

/(thcl*,*MUthc)r(la!the)rfail). 



(182) 



Here, * is a symbol which denotes a sequence boundary, 
and the factor l(thel*,*) is the trigram language model 
parameter that serves as an estimate of the probability that 
the English morpheme the occurs at the beginning of a 
sentence. The factor n(llthe) is the translation model param- 
eter that is an estimate of the probability that the English 
morpheme the has fertility 1, in other words, that the English 
morpheme the is aligned with only a single French mor- 
pheme. The factor t(lalthe) is the translation model param- 
eter that serves as an estimate of the lexial probability that 
the English morpheme the translates to the French mor- 
pheme la. Finally, the factor d(lll) is the translation model 
parameter that serves as an estimate of the distortion prob- 
ability that a French morpheme will be placed in position 1 
of the French structure given that it is aligned with an 
English morpheme that is in position 1 of the English 
structure. In the second example in FIG. 57, the English 
morpheme mother is hypothesized as accounting for the 
French morpheme mere. The score for this partial hypothesis 



/(nwlhei1**MllniothCT)/(iiArdmoaicr)<f(711). 



(183) 



Here, the final d(7ll) serves as an estimate of the distor- 
tion probability that a French morpheme, such as mere, will 
be placed in the 7th position in a source sequence given that 
it is aligned with an English morpheme such as mother 
which is in the 1st position in an hypothesized target 
sequence. 

Now, suppose the partial hypothesis in FIG. 56 is to be 
extended on some other invocation of step 5404. A common 
translation of the pair of French morphemes jeune fille is the 
English morpheme girl. However, since in a preferred 
embodiment a partial hypothesis is extended to account for 
only a single French morpheme at a time, it is not possible 
to account for both jeune and fille with a single extension. 
Rather the system first accounts for one of the morphemes, 
and then on another round of extensions, accounts for the 
other. This can be done in two ways, either by account first 
for jeune or by accounting first for fille. FIG. 58 depicts the 
extension that accounts first for fille. The + symbol in FIG. 
58 after the the English morpheme girl denotes the fact that 
in these extensions girl is to be aligned with more French 
morphemes than it is currently aligned with, in this case, at 
least two. A morpheme so marked is referred to as open. A 
morpheme that is not open is said to be closed. A partial 
hypothesis which contains on open target morpheme is 
referred to as open, or as an open partial hypothesis. A partial 
hypothesis which is not open is referred to as closed, or as 
a closed partial hypothesis. An extension is referred to as 
either open or closed according to whether or not the 
resultant partial hypothesis is open or closed. In a preferred 
embodiment, only the last morpheme in a partial hypothesis 
can be designated open. The score for the extension in FIG. 
58 is 



60 



/(giril*,tbe)2 



( J* ) 



(184) 



f(fiHe!girl)d(3l2). 



Here, the factor l(girll*,the) is the language model param- 
eter that serves as an estimate of the probability with which 
the English morpheme girl is the second morpheme in a 
source structure in which the first morpheme is the. The next 
factor of 2 is the combinatorial factor that is discussed in the 
section entitled Translation Models and Parameter Estima- 
tion. It is factored in, in this case, because the open English 



5,477,451 



93 



W,*>i(lfthc)r(/fllthc)<f(m) x 



(185) 



/(girlJMhc)2 ( JjjnOIgH) ) r(ffllelgirl)rf(3l2). 



*(21giri) 



/ 25 

^ .Zn(ilgirl) 



) 



J rQcunelgiriMC2l2). 



30 



35 



Consider now an extension to the partial hypothesis in 
FIG. 58. If a partial hypothesis that is to be extended 
contains an open morpheme, then, in a preferred embodi- 
ment, that hypothesis can only be extended by aligning 
another morpheme from the source structure with that open 
morpheme. When such an extension is made, there are two 
possibilities: 1) the open morpheme is kept open in the 40 
extended partial hypothesis, indicating that more source 
morphemes are to be aligned with that open target mor- 
pheme, or 2) the open morpheme is closed indicating that no 
more source morphemes are to be aligned with that target 
morpheme. These two cases are illustrated in FIGS, 59 and 45 
60. 

In FIG. 59, an extension is made of the partial alignment 
in FIG. 58 by aligning the additional French morpheme 
jeune with the English morpheme girl. In this example the 50 
English morpheme girl is then closed in the resultant partial 
hypothesis. The extension score for this example is 



(186) 



55 



Here, the firs quotient adjusts the fertility score for the w 
partial hypothesis by dividing out the estimate of the prob- 
ability that girl is aligned with at least -two French mor- 
phemes and by multiplying in an estimate of the probability 
that girl is aligned with exactly two French morphemes. As 
in the other examples, the second and third factors are 65 
estimates of the lexical and distortion probabilities associ- 
ated with this extensioa 



94 



morpheme girl is to be aligned with at least two French 
morphemes. The factor n(ilgirl) is the translation model 
parameter that serves as an estimate of the probability that 
the English morpheme girl will be aligned with exactly i 
French morphemes, and the sum of these parameters for i 5 
between 2 and 25 is an estimate of the probability that girl 
will be aligned with at least 2 morphemes. It is assumed that 
the probability that an English morpheme will be aligned 
with more that 25 French morphemes is 0. Note that in a i 0 
preferred embodiment of the present invention, this sum can 
be prepared and stored in memory as a separate parameter. 
The factor t(fillelgirl) is the translation model parameter that 
serves as an estimate of the lexical probability that one of the 
French morphemes aligned with the English morpheme girl 15 
will be the French morpheme fille. Finally, the factor d(3l2) 
is the translation model parameter that serves as an estimate 
of the distortion probability that a French morpheme will be 
placed in position 3 of the French structure given that is 2Q 
aligned with an English morpheme which is in position 2 of 
the English structure. This extension score in Equation 184 
is multiplied by the score in Equation 182 for the partial 
hypothesis which is being extended to yield a new score for 
the partial hypothesis in FIG. 56 of 25 



In FIG. 60, the same extension is made as in FIG. 59, 
except here the English morpheme girl is kept open after the 
extension, hence the + sign. The extension score for this 
example is 



(187) 



25 \ 
Zn(flgirl) ) 



F=3 



[ ( J« ) / 



I r(jeiinelgiii)d(2!2). 



Here, the factor of 3 is the adjustment to the combinatorial 
factor for the partial hypothesis. Since the score for the 
partial hypothesis in FIG. 59 already has a combinatorial 
factor of 2, the score for the resultant partial hypothesis in 
FIG. 60, will have a combinatorial factor of 2x3=3!. The 
quotient adjust the fertility score for the partial hypothesis to 
reflect the fact that in further extensions of this hypothesis 
girl will be aligned with at least three French morphemes. 

Another type of extension performed in the hypothesis 
search is one in which two additional target morphemes are 
appended to the partial target structure of the partial hypoth- 
esis being extended. In this type of extension, the first of 
these two morphemes is assigned a fertility of zero and the 
second is aligned with a single morpheme from the source 
structure. This second target morpheme may be either open 
or closed. 

FIG. 62 shows an extension of the partial hypothesis in 
FIG. 61 by the two target morphemes up her, in which her 
is aligned with the source morpheme sa. The score for this 
extension is 

J(uplgirl, to_wake)/(herlto_wake, up)n(0lup)rt(llhcr)f(salher)<f(6l 
6). (188) 

Here, the first two factors are the trigram language model 
estimates of the probabilities with which up follows girl 
to_wake, and with which her follows to_wake up, respec- 
tively. The third factor is the fertility parameter that serves 
as an estimate of the probability that up is aligned with no 
source morphemes. The fourth, fifth, and sixth factors are 
the appropriate fertility, lexical, and distortion parameters 
associated with the target morpheme her in mis partial 
alignment 

FIG. 63 shows a similar extension by up her. The differ- 
ence with the extension in FIG. 62 is that in FIG. 63, the 
source morpheme her is open. The score for this extension 
is 



/(up1girl,to wake)/(herlto_waJce,up)fl(Ofup) x 



(189) 



( j? nOTber) ) 



i(«rihcrM(6l6). 



The score for this extension differs from the score in 
Equation 188 in that the fertility parameter n(llher) is 
replaced by the combinatorial factor 2 and the sum of 
fertility parameters which provides an estimate of the prob- 
ability that her will be aligned with at least two source 
morphemes. 

A remaining type of extension is where a partial hypoth- 
esis is extended by an additional connection which aligns a 
source morpheme with the null target morpheme. The score 
for this type of extension is similar to those described above. 
No language model score is factored in, and scores from the 
translation model are factored in, in accordance with the 



5,477,451 



95 

probabilities associated with the null word as described in 
the section entitled Translation Models and Parameter Esti- 
mation. 

14.3 Selection of Hypothesis to Extend 5402 

Throughout the hypothesis search process, partial hypoth- 5 
eses are maintained in a set of priority queues. In theory, 
there is a single priority queue for each subset of positions 
in the source structure. So, for example, for the source 
structure oui, oui, three positions, oui is in position 1; a 
comma is in position 2; and oui is in position 3, and there are 
therefore 2 3 subsets of positions: [], [1], [2], [3], [1,2], [1,3], 
[2,3], and [1,2,3]. In practice these priority queues are 
initialized only on demand, and many less than the full 
number of queues possible are used in the hypothesis search. 
In a preferred embodiment, each partial hypothesis is com- 
prised of a sequence of target morphemes, and these mor- 15 
phemes are aligned with a subset of source morphemes. 
Corresponding to that subset of source morphemes is a 
priority queue in which the partial hypothesis is stored. The 
partial hypotheses within a queue are prioritized according 
to the scores associated with those hypotheses. In certain 20 
preferred embodiments the priority queues are limited in 
size and only the 1000 hypothesis with the best scores are 
maintained. 

Hie set of all subsets of a set of source structure positions 
can be arrange in a subset lattice. For example, the subset 25 
lattice for the set of all sets of the set [ 1 ,2,3] is shown in FIG. 
64. In a subset lattice, a parent of a set S is any which 
contains one less element than S, and which is also a subset 
of S. In FIG. 64 arrows have been drawn from each set in the 
subset lattice to each of its parents. For example, the set [2] 30 
is a parent of the set [1,2]. 

A subset lattice defines a natural partial ordering on a set 
of sets. Since the priority queues used in hypothesis search 
are associated with subsets, a subset lattice also defines a 
natural partial ordering on the set of priority queues. Thus in 35 
FIG. 64, there are two parents of the priority queue associ- 
ated with the subset of source structure positions [1,3]. 
These two parents are the priority queues associated with the 
set [1] and [3]. A priority queue Q x is said to be an ancestor 
of another priority Q2 if 1) Qi is not equal to Q 2 , and 2) Q l 40 
is a subset of Q2, If Q, is an ancestor of Q 2 , then is said 
to be to be a descendant of Q x . 

Considering now the process by which a set of partial 
hypotheses are selected in step 5402 to be extended in step 
5404, when step 5402 is invoked, it is invoked with a list of 45 
partial hypotheses that were either 1) created by the initial- 
ization step 5401, or 2) created as the result of extensions in 
step 5404 on a previous pass through the loop comprised of 
steps 5402, 5403, and 5404. These partial hypotheses are 
stored in priority queues according to the sets of source 50 
morphemes they account for. For example, the partial 
hypothesis in FIG. 65 would be stored in the priority queue 
associated with the set [2,3], since it accounts for the source 
morphemes in positions 2 and 3. 

A priority queue is said to be active if there are partial 55 
hypotheses stored in it. An active priority queue is said to be 
on the frontier if it has no active descendent. The cardinality 
of a priority queue is equal to the number of elements in the 
subset with which it is associated. So, for example, the 
cardinality of the priority queue which is associated with the 60 
set [2,3] is 2. 

The process in step 5402 functions by assigning a thresh- 
old to every active priority queue and then places on the list 
of partial hypotheses to be extended every partial hypothesis 
on an active priority queue that has an a score that is greater 65 
than the threshold for that priority queue. This is depicted in 
FIG. 66. First, in step 6601 the threshold for every active 



96 

priority queue is initalized to infinity, in practice, some very 
large number. Second, in step 6602, thresholds are deter- 
mined for every priority queue on the frontier. 

The method by which these thresholds are computed is 
best described by first describing what the normalizer of a 
priority queue is. Each priority queue on the frontier corre- 
sponds to a set of positions of source morphemes. At each 
position of these positions is a particular source morpheme. 
Associated with each morpheme is a number, which in a 
preferred embodiment is the unigram probability of that 
source morpheme. These unigram probabilities are esti- 
mated by transducing a large body of source text and simply 
counting the frequency with which the different source 
morphemes occur. The normalizer for a priority queue is 
defined to be the product of all the unigram probabilities for 
the morphemes at the positions in the associated set of 
source structure positions. For example, the normalizer for 
the priority queue associated with the set [2,3] for the source 
structure la jeune fille V_past__3s reveiller sa mere is: ' 

Donnali2er(f231)=Pr0etme)Pr<fille). (190) 

For each priority queue Q on the frontier define the 
normed score of Q to be equal to the score of the partial 
hypothesis with the greatest score in Q divided by the 
normalizer for Q. Let Z be equal to the maximum of all 
normed scores for all priority queues on the frontier. The 
threshold assigned to a priority queue Q on the frontier is 
then equal to Z times the normalizer for that priority queue 
divided by a constant which in a preferred embodiment is 
45. 

After step 6602, thresholds have been assigned to the 
priority queues on the frontier, a loop is performed in steps 
6604 through 6610. The loop count i is equal to a different 
cardinality on each iteration of the loop. The counter i is 
initialized in step 6604 to the largest cardinality of any active 
priority queue, in other words, i is initialized to the maxi- 
mum cardinality of any priority queue on the frontier. On 
each iteration of the loop the value of i is decreased by 1 
until i is equal to 0, at which point the test 6604 is satisfied 
and the process of selecting partial hypotheses to be 
extended is terminated 

Inside the loop through cardinalities is another loop in 
steps 6606 through 6609. This is a loop through all active 
priority queues of a given cardinality. In this loop each 
priority queue of cardinality i is processed in step 6608. 

A schematic flow diagram for this processing step 6608 is 
shown in FIG. 67. The priority queue Q to be processed 
enters this step at 6701. Steps 6704 through 6707 perform a 
loop through all partial hypotheses i in the priority queue Q 
which are greater than the threshold associated with Q. At 
step 6705 the partial hypothesis i is added to the list of partial 
hypotheses to be extended. At step 6706 i is used to adjust 
the thresholds of all active priority queues which are parents 
of Q. These thresholds are then used when priority queues of 
lower priority are processed in the loop beginning at step 
6604 in FIG. 66. 

Each priority queue which is a parent of partial hypothesis 
i at step 6706 contains partial hypotheses which account for 
one less source morpheme than the partial hypothesis i does. 
For example, consider the partial hypothesis depicted in 
FIG. 59. Suppose this is the partial hypothesis i. The two 
target morphemes the and girl are aligned with the three 
source morphemes la, jeune, and fille which are in source 
structure positions 1, 2, and 3 respectively. This hypothesis 
i is therefore in the priority queue corresponding to the set 
[1,2^]. The priority queues that are parents of this hypoth- 
esis correspond to the sets [1,2], [13], and [2,3]. We can use 



5,477,451 



97 



98 



partial hypothesis i to adjust the threshold in each of these 
priority queues, assuming they are all active, by computing 
a parent score, score p from the score score, associated with 
the partial hypothesis i. A potentially different parent score 
is computed for each active parent priority queue. That 
parent score is then divided by a constant, which in a 
preferred embodiment is equal to 45. The new threshold for 
that queue is then set to the minimum of the previous 
threshold and that parent score. 

These parent scores are computed by removing from 
score j the contributions for each of the source morphemes 
la, jeune, and fille. For example, to adjust the threshold for 
the priority queue [2,3], it is necessary to remove the 
contribution to the score, associated with the source mor- 
pheme in position 1, which is la. This morpheme is the only 
morpheme aligned with the, so the language model contri- 
bution for the must be removed, as well as the translation 
model contributions associated with la. Therefore: 



15 



(19D 20 



/(ihd*,boundary)/(tolthc>j(llthc)</(I,I) ' 

As another example, to adjust the threshold for the 
priority queue [1,3], it is necessary to remove the contribu- 
tion to the score/ associated with the source morpheme in 
position 2, which is jeune. This morpheme is one of two 
aligned with the target morpheme girl. If the connection 
between girl and jeune is removed from the partial align- 
ment in FIG. 59, there is still a connection between girl and 
fille. In other words, girl is still needed in the partial 
hypothesis to account for fille. Therefore, no language 
model component is removed. The parent score in this case 
is: 



KOTCp = score/ 



.ZnOlgirl) 



rOcoDctgirl) 



1 

"3(2cr 



25 



30 



(192) 



35 



n(2Jgiri) 

Here, the first quotient adjust the fertility score, the second 
adjusts the lexical score and the third adjusts the distortion 
score. . 

With some thought, it will be clear to one skilled jn the art 
how to generalize from these examples to other situations. In 
general, a parent score is computed by removing a connec- 
tion from the partial alignment associated with the partial 
hypothesis i. Such a connection connects a target morpheme 
t in the partial target structure associated with the partial 
hypothesis i and a source morpheme s in a source structure. 
If this connection is the only connection to the target 
morpheme t, then the language model score for t is divided 
out, otherwise it is left in. The lexical and distortion scores 
associated with the source morpheme s are always divided 
out, as is the fertility score associated with the target 
morpheme t. If n connections remain to the target morpheme 
t, since n+1 source morphemes are aligned with t in the 
partial hypothesis i, then the open fertility score serving as 
an estimate of the probability that at least n+1 source 
morphemes will be aligned with t is multiplied in. 

When the list of hypotheses to be extended that is created 
in step 5402 is empty the search terminates. 

Refer now to step 5404 in FIG. 54. This step extends a list 
of partial hypotheses. An embodiment of the method by 
which this extension takes place is documented in the 
pseudo code in FIG. 68. 

The procedure extend_parual_hypotheses_on_list 
takes as input a list of hypotheses to be extended. Lines 1 
through 8 contain a loop through all the partial hypotheses 
on the list which are extended in turn. In line 3 the variable 
h is set to the partial hypothesis being extended on iteration 
of the loop. For each hypothesis, h, that is extended, it can 



45 



50 



55 



60 



65 



be extended by aligning an additional source morpheme 
with a morpheme in an hypothesized target structure. In 
lines 4 through 7 a loop is made through every position p in 
the source structure. In certain embodiments, a loop may be 
made only through the first n source positions that are not 
already aligned in the partial hypothesis h. In a preferred 
embodiment n is set to 4. At line 5 a test is made to see if 
the source morpheme at position p is already aligned by the 
partial hypothesis with an hypothesized target morpheme. If 
it is not, then at line 6 a call is made to the procedure 
extend_h_by_accounting_for_source_morpheme_ 
in_position_p is contained in lines 10 through 31 of Figure. 
At line 11 a check is made to determine if the partial 
hypothesis h is open, in other words, if it contains an open 
target morpheme. If it is open then extensions are made in 
lines 12 through 14. On line 12, the variable q is set to the 
position in the hypothesized partial target structure of the 
open morpheme. Each of these extensions made in lines 12 
through 14 are made by adding a connection to the partial 
alignment of h. Each such connection is a connection from 
the morpheme at position p in the source structure to the 
open morpheme at position q in the target structure. On line 
13, an extension is created in which a connection from p to 
q is added to the partial alignment of h and in which the 
morpheme at position t is kept open. On line 14, an exten- 
sion is created in which a connection from p to q is added 
to the partial alignment of b and in which the morpheme at 
position t is closed. 

Extensions of partial hypotheses h which are closed are 
made in lines 17 through 29. First, in line 17 the variable s 
is set to the identity of the source morpheme at position p in 
the source structure. This morpheme will have a number of 
possible target translations. In terms of the translation 
model, this means that there will be a number of target 
morphemes t for which the lexical parameter t(tls) is greater 
than a certain threshold, which in an embodiment is set equal 
to 0.001. The list of such target morphemes for a given 
source morpheme s can be precomputed. In lines 18 through 
29 a loop is made through a list of the target morphemes for 
the source morpheme s. The variable t is set to the target 
morpheme being processed in the loop. On line 20, an 
extension is made in which the target morpheme t is 
appended to the right end of the partial target structure 
associated with h and then aligned with the source mor- 
pheme at position p, and in which the target morpheme t is 
open in the resultant partial hypothesis. On line 21, an 
extension is made in which the target morpheme t is 
appended to the right end of the partial target structure 
associated with h and then aligned with the source mor- 
pheme at position p, and in which the target morpheme t is 
closed in the resultant partial hypothesis. On line 22, an 
extension is made in which the target morpheme t is 
appended to the null target morpheme in the partial target 
structure associated with hypothesis h. It is assumed 
throughout this description of hypothesis search that every 
partial hypothesis comprises a single null target morpheme. 

The remaining types of extensions to be performed are 
those in which the target structure is extended by two 
morphemes. In such extensions, the source morpheme at 
position p is aligned with the second of these two target 
morphemes. On line 23, a procedure is called which creates 
a list of target morphemes that can be inserted between the 
last morpheme on the right of the hypothesis h and the 
hypothesized target morpheme, t. The lists of target mor- 
phemes created by this procedure can be precomputed from 
language model parameters. In particular, suppose t rr is the 
last morpheme on the right of the partial target structure 



5,477,451 



99 



100 



comprised by the partial hypothesis h. For any target mor- 
pheme ti the language model provides a score for the 
three-word sequence t^t. In one preferred embodiment this 
score is equal to an estimate of 1-gram probability for the 
morpheme t„ multiplied by an estimate of the probability 5 
with 2-gram conditional probability with which t, follows t„ 
multiplied by an estimate of the 3-gram conditional prob- 
ability with which tj follows the pair y, . By computing such 
a score for each target morpheme t lf the target morphemes 
can be ordered according to these scores. The list returned by 10 
the procedure called on line 23 is comprised of the m best 
tj — s which have scores greater than a threshold z. In one 
embodiment, z is equal to 0.001 and m is equal to 100. 

Hie loop on lines 24 through 28 makes extensions for 
each t, on the list created on line 23. On lines 26 and 27, is 
extensions are made in which the pair of target morphemes 
t lf t is appended to the end of the partial target structure 
comprised by the partial hypothesis h, and in which the 
source morpheme at position p is aligned with t. The 
hypothesis which result from extensions made on line 23 are 20 
open and the hypotheses which result from extensions made 
on line 23 are closed. 

While various embodiments of the present invention have 
been described above, it should be understood that they have 
been presented by way of example, and not limitation. Thus 25 
the breadth and scope of the present invention should not be 
limited by any of the above-described exemplary embodi- 
ments, but should be defined only in accordance with the 
following claims and their equivalents. 

Having thus described our invention, what we claim as 30 
new and desire to secure by Letters Patent is: 

1. A method operating on a computer for translating 
source text from a first language to target text in a second 
language different from the first language, said method 
comprising the steps of: 35 

measuring a value of the source text in the firs language 
and storing the source text in a first memory buffer; 

generating the target text in the second language based on 
a combination of a probability of occurrence of an 
intermediate structure of text associated with a target 40 
hypothesis selected from the second language using a 
target language model, and a probability of occurrence 
of the source text given the occurrence of said inter- 
mediate structure of text associated with said target 
hypothesis using a target-to-source translation model; 45 
and 

performing at least one of a storing operation to save said 
target text in a second memory buffer and a presenting 
operation to make said target text available for at least 5Q 
one of a viewing and listening operation using an I/O 
device. 

2. A method according to claim 1, further comprising the 
step of: 

receiving at least one user defined criteria pertaining to 55 
the source text to thereby bound the target text, said 
criteria belonging to the group comprising specification 
of a noun phrase, specification of subject, specification 
of verb, a semantic sense of a word or group of words, 
inclusion of a predetermined word or group of words, w 
and exclusion of a predetermined word or group of 
words. 

3. A method operating on a computer for translating 
source text from a first language to target in a second 
language different from the first language, said method 65 
comprising the steps of: 

receiving the source text in the first language; 



generating at least one target hypothesis, each of said 
target hypotheses comprising text selected from the 
second language; 

estimating, for each target hypothesis, a first probability 
of occurrence of said text associated with said target 
hypothesis using a target language model; 

estimating, for each target hypothesis, a second probabil- 
ity of occurrence of the source text given the occur- 
rence of said text associated with said target hypothesis 
using a target-to-source translation model; 

combining, for each target hypothesis, said first and 
second probabilities to produce a target hypothesis 
match score; and 

perforrning at least one of a storing operation to save and 
a presenting operation to make available for at least one 
of a viewing and listening operation, at least one of said 
target hypotheses according to its associated match 
score. 

4. A method according to claim 3, further comprising the 
steps of: 

generating a list of partial target hypotheses using said 
target language model and said target-to-source trans- 
lation model; 

extending said partial hypotheses to generate a set of new 
hypotheses; 

deterrnining a match score for at least one of said new 

hypotheses; and 
performing at least one of a storing operation to save and 

a presenting operation to make available for viewing, at 

least one of said new hypotheses according to its 

associated match score. 

5. A method according to claim 4, wherein said match 
score determining step further comprises the steps of: 

determining a language model score for each of said 
partial hypotheses using said target language model; 

determining a translation model score for each of said 
partial hypotheses using said target-to-source transla- 
tion model; and 

determining a combined score of said language and 
translation model scores for each of said partial hypoth- 
eses. 

6. A method according to claim 5, wherein said translation 
model score deterrnining step further comprises at least one 
of the following steps: 

deterrnining a fertility model score for each of said partial 
hypotheses, said partial hypotheses comprising at least 
one notion and said fertility model score being propor- 
tional to a probability that a notion in the target text will 
generate a specific number of units of linguistic struc- 
ture in the source text; 

determining a alignment score for each of said partial 
hypotheses, said alignment score being proportional to 
a probability that a unit of linguistic structure in the 
target text will align with one of zero or more units of 
linguistic structure in the source text; 

deterrnining a lexical model score, for each of said partial 
hypotheses, said lexical model score being proportional 
to the probability that said units of linguistic structure 
in the target text of a given partial hypothesis will 
translate into said units of linguistic structure of said 
source text; 

determining a distortion model score for each of said 
partial hypotheses, said distortion model score being 
proportional to the probability that a source unit of 



5,477,451 



101 



102 



20 



linguistic structure will be in a particular position given 
a position of the target units of linguistic structure that 
generated it; and 

determining a combined score, said combined score being 
proportional to a product of at leas tone of said fertility, 5 
alignment, lexical and distortion model scores for each 
of said partial hypotheses. 

7. A method according to claim 5, wherein said translation 
model score deterrnirring step further comprises at least one 1Q 
of the following steps: 

determining a fertility model score for each of said partial 
hypotheses, said partial hypotheses comprising at least 
one notion and said fertility model score being propor- 
tional to a probability that a notion in the target text will 15 
generate a specific number of notion units of linguistic 
structure in an intermediate structure source text; 

determining an alignment score for each of said partial 
hypotheses, said alignment score being proportional to 
the probability that, a unit of linguistic structure in said 
intermediate structure of target text will align with one 
of zero or more units of linguistic structure in said 
intermediate structure of source text; 

determining a lexical model score, for each of said partial 25 
hypotheses, said lexical model score being proportional 
to a probability that said units of linguistic structure in 
said intermediate structure of target text of a given 
partial hypothesis will translate into said units of lin- 
guistic structure of said intermediate structure source, 30 
text; 

determining a distortion model score for each of said 
partial hypotheses, said distortion model score being 
proportional to a probability that a source unit of 35 
linguistic structure will be in a particular position given 
the position of the target units of linguistic structure 
that generated it; and 

determining a combined score, said combined score being 
proportional to a product of at least one of said fertility, 40 
alignment, lexical and distortion model scores for each 
of said partial hypotheses. 

8. A method according to claim 6, wherein each of said 
notions comprises at least one unit of linguistic structure, 
and each of said units of linguistic structure in the target text 45 
produces one of zero or more units of linguistic structure in 
the source text 

9. A method according to claim 3, further comprising the 
step of: 

receiving at least one user defined criteria pertaining to 
the source text to thereby bound the target text, criteria 
belonging to the group comprising specification of a 
noun phrase, specification of subject, specification of 
verb, a semantic sense of a word or group of words, 55 
inclusion of a predetermined word or group of words, 
and exclusion of a predetermined word or group of 
words. 

10. A method operating on a computer for translating 
source text from a first language to target text in a second ^ 
language different from the first language, said method 
comprising the steps of: 

receiving the source text in the first language and storing 

the source text in a first memory buffer; 
receiving one of zero or more user defined criteria per- 65 

taming to the source and target texts to thereby bound 

the target text; 



50 



accessing the source text from said first buffer and trans- 
ducing the source text into at least one intermediate 
source structure of text constrained by any of said user 
defined criteria; 

generating at least one target hypothesis, each of said 
target hypotheses comprising an intermediate target 
structure of text selected from the second language 
constrained by any of said user defined criteria; 

estimating a first score, said first score being proportional 
to a probability of occurrence of each intermediate 
target structure of text associated with said target 
hypotheses using a target structure language model; 

estimating a second score, said second score being pro- 
portional to a probability that said intermediate target 
structure of text associated with said target hypotheses 
will translate into said intermediate source structure of 
text using a target structure-to-source structure trans- 
lation model; 

combining, for each target hypothesis, said first and 
second scores to produce a target hypothesis match 
score; 

transducing each of said intermediate target structures of 
text into at least one transformed target hypothesis of 
text in the second language constrained by any of said 
user defined criteria; and 

performing at least one of a storing operation to save said 
at least one transformed target hypothesis in a second 
memory buffer and a presenting operation to make 
available for at least one of a viewing and listening 
operation according to its associated match score and 
said user defined criteria. 

11. A method according to claim 10, wherein said inter- 
mediate source and target structures are transduced by: 

arranging and rearranging, respectively, elements of the 
source and target text according to at least one of 
lexical substitutions, part-of-speech assignment, a mor- 
phological analysis, and a syntactic analysis. 

12. A method according to claim 11, wherein said syn- 
tactic analysis further comprises the steps of: 

performing question inversion; 
performing do-not coalescence; and 
performing adverb movement 

13. A method according to claim 11, wherein said inter- 
mediate source structure transducing step further comprises 
the step of: 

annotating said, words with parts of speech by probablis- 
tically assigning a parts of speech to said words accord- 
ing to a statistical model. 

14. A method according to claim 13, wherein said inter- 
mediate source structure transducing step further comprises 
the step of: 

analyzing, syntactically, said annotated words and out- 
putting a sequence of words and a sequence of said 
parts of speech corresponding to said sequence of 
words. 

15. A method according to claim 14, wherein said inter- 
mediate source structure transducing step further comprises 
the step of: 

analyzing, morphologically, said syntactically analyzed 
words by performing a morphological analysis of said 
words in accordance with said assigned parts of speech. 

16. A method according to claim 15, wherein said inter- 
mediate source structure transducing step further comprises 
the step of: 

assigning a sense to at least one of said morphological 
units to elucidate the translation of that morphological 
unit in the target language. 



5,477,451 



103 



104 



10 



15 



20 



17. A method according to claim 11, wherein said syn- 
tactic analysis further comprises the steps of: 

performing question inversion; 
performing negative coalescence; and 
performing at least one of adverb, pronoun and adjective 
movement. 

18. A method according to claim 10, wherein said inter- 
mediate source structure transducing step further comprises 
the step of: 

tokenizing the source text by identifying individual words 
and word separators and arranging said words and said 
word separators in a sequence. 

19. A method according to claim 18, wherein said inter- 
mediate source structure transducing step further comprises 
the step of: 

deterrnining case of said words using a case transducer, 
said case transducer assigning each word a token and a 
case pattern, said case pattern specifying a case of each 
letter of the word, and evaluating each case pattern to 
determine a true case pattern of the word. 

20. A method according to claim 19, wherein said deter- 
mining step further comprises the steps of: 

(1) determining whether said token is part of a name; 

(2) if said token is not part of a name: 

determining whether said token is a member of a set of 25 
tokens which have only one true case; if said token 
is part of a name: 

setting the true case pattern according to a table of true 
cases for names; 

(3) if the true case has not been determined by steps 1 and 30 
2, and if the token is a first token in a sentence, then: 
setting the true case pattern equal to a most probable 

true case for that token; and 

(4) otherwise: 

setting the true case pattern equal to the only one case 35 
for that token. 

21. A method according to claim 10, wherein each of said 
intermediate target structures is expressed as an ordered 
sequence of units of linguistic structure, and a first score 
probability is obtained by multiplying conditional probabili- 40 
ties of each of said units of linguistic structure within an 
intermediate target structure given an occurrence of previous 
units of linguistic structure within said intermediate target 
structure. 

22. A method according to claim 21, wherein said con- 45 
ditional probability of each unit of each of linguistic struc- 
ture within an intermediate target structures depends only on 

a fixed number of preceding units within said intermediate 
target structure. 

23. A computer system for translating source text from a 50 
first language to target text in a second language different 
from the first language, said system comprising: 

means for receiving the source text in the first language; 

means for generating at least one target hypothesis, each 
of said target hypotheses comprising text selected from 
the second language; 

means for estimating, for each target hypothesis, a first 
probability of occurrence of said text associated with 
said target hypothesis using a target language model; 

means for estimating, for each target hypothesis, a second 
probability of occurrence of the source text given the 
occurrence of said text associated with said target 
hypothesis using a target-to-source translation model; 

means for combining, for each target hypothesis, said first 
and second probabilities to produce a target hypothesis 
match score; and 



55 



60 



65 



means for performing at least one of a storing and 
presenting operation to save or otherwise make avail- 
able for at least one of a viewing and listening opera- 
tion, at least one of said target hypotheses according to 
its associated match score. 

24. A system according to claim 23, further comprising: 
means for generating a list of partial target hypotheses 

using said target language model and said target-to- 
source translation model; 

means for extending said partial hypotheses to generate a 
set of new hypotheses; 

means for deterrnining a match score for at least one of 
said new hypotheses; and 

means for perforating at least one of a storing and 
presenting operation to save or otherwise make avail- 
able for at least one of a viewing and listening opera- 
tion, at least one of said new hypotheses according to 
its associated match score. 

25. A system according to claim 24, wherein said means 
for determining said match score further comprises: 

means for deterrmning a language model score for each of 
said partial hypotheses using said target language 
model; 

means for deterrnining a translation model score for each 
of said partial hypotheses using said target-to-source 
translation model; 

and means for combining said language and said trans- 
lation model scores for each of said partial hypotheses 
into a combined score. 

26. A system according to claim 25, wherein said means 
for determining said translation model score further com- 
prises: 

means for determining a fertility model score for each of 
said partial hypotheses, said partial hypotheses com- 
prising at least one notion and said fertility model score 
being proportional to the probability that a notion in the 
target text will generate a specific number of notion 
units of linguistic structure in an intermediate structure 
source text; 

means for deterrnining an alignment score for each of said 
partial hypotheses, said alignment score being propor- 
tional to a probability that a unit of linguistic structure 
in said intermediate structure of target text will align 
with one of zero or more units of linguistic structure in 
said intermediate structure of source text; 

means for determining a lexical model score, for each of 
said partial hypotheses, said lexical model score being 
proportional to the probability that said units of lin- 
guistic structure in said mtermediate structure of target 
text of a given partial hypothesis will translate into said 
units of linguistic structure of said intermediate struc- 
ture source text; 

means for deterrnining a distortion model score for each 
of said partial hypotheses, said distortion model score 
being proportional to a probability that a source unit of 
linguistic structure will be in a particular position given 
the position of the target units of linguistic structure 
that generated it; and 

means for deterrnining a combined score, said combined 
score being proportional to the product of at least one 
of said fertility, alignment, lexical and distortion model 
scores for each of said partial hypotheses. 

27. A system according to claim 25, wherein said means 
for deterrnining said translation model score further com- 
prises: 



5,477,451 



105 



106 



means for determining a fertility model score for each of 
said partial hypotheses, said partial hypotheses com- 
prising at least one notion and said fertility model score 
being proportional to the probability that a notion in the 
target text will generate a specific number of notions of 5 
units of linguistic structure in the source text; 

means for deterrnining an alignment score for each of said 
partial hypotheses, said alignment score being propor- 
tional to the probability that a unit of linguistic struc- 
ture in the target text will align with one of zero or more 10 
units of linguistic structure in the source text; 

means for determining a lexical model score, for each of 
said partial hypotheses, said lexical model score being 
proportional to the probability that said units of lin- 
guistic structure in the target text of a given partial 15 
hypothesis will translate into said units of linguistic 
structure of. source text; 

means for deterniining a distortion model score for each 
of said partial hypothesis, said distortion model score 
being proportional to the probability that the source 20 
units of linguistic structure will be in a particular 
position given the position of the target units of lin- 
guistic structure that generated it; and 

means for determining a combined score, said combined 
score being proportional to the product of at least one ^ 
of said fertility, alignment, lexical and distortion model 
scores for each of said partial hypotheses. 

28. A system according to claim 27, wherein each of said 
notions comprises at least one unit of linguistic structure, 
and each of said units of linguistic structure in the target text 30 
produces one of zero or more units of linguistic structure in 
the source text 

29. A system according to claim 23, further comprising: 

means for receiving at least one user defined criteria 
pertaining to the source text to thereby bound the target 35 
text, said criteria belonging to a group comprising 
specification of a noun phrase, specification of subject, 
specification of verb, a semantic sense of a word or 
group of words, inclusion of a predetermined word or 
group of words and exclusion of a predetermined word 40 
or group of words. 

30. A computer system for translating source text from a 
first language to target text in a second language different 
from the first language, said system comprising: 45 

means for receiving the source text in the first language 
and storing the source text in a first memory buffer; 

means for receiving one of zero or more user defined 
criteria pertaining to the source and target texts to 
thereby bound the target text; 50 

means for accessing the source text from said first buffer, 

first transducing means for transducing the source text 
into at least one intermediate source structure of text 
constrained by any of said user defined criteria; 

means for generating at least one target hypothesis, each 
of said target hypotheses comprising a intermediate 
target structure of text selected from the second lan- 
guage constrained by any of said user defined criteria; 

means for estimating a first score, said first score being 50 
proportional to a probability of occurrence of each 
intermediate target structure of text associated with said 
target hypotheses using a target structure language 
model; 

means for estimating a second score, said second score 65 
being proportional to a probability that said interme- 
diate target structure of text associated with said target 



55 



hypotheses will translate into said intermediate source 
structure of text using a target structure-to-source struc- 
ture translation model; 
means for combining, for each target hypothesis, said first 
and second scores to produce a target hypothesis match 
score; 

second transducing means for transducing each of said 
intermediate target structures of text into at least one 
transformed target hypothesis of text in the second 
language constrained by any of said user defined cri- 
teria; and 

means for at least one of storing said at least one trans- 
formed target hypothesis in a second memory buffer, 
and presenting or otherwise making said at least one 
transformed target hypothesis available for at least one 
of a viewing and listening operation according to its 
associated match score and said user defined criteria. 

31. A system according to claim 30, wherein said first and 
second transducing means further comprise: 

third means for arranging elements of the source text 
according to at least one of lexical substitutions, part- 
of- speech assignment, a morphological analysis, and a 
syntactic analysis; and fourth means for rearranging 
elements of the target text according to at least one of 
lexical substitutions, part-of-speech assignment, a mor- 
phological analysis, and a syntactic analysis. 

32. A system according to claim 31, further comprising: 

means for performing question inversion; 
means for performing do-not coalescence; and . 
means for performing adverb movement. 

33. A system according to claim 31, further comprising: 

means for annotating said words with parts of speech by 
probablistically assigning a parts of speech of said 
words according to a statistical model. 

34. A system according to claim 33, further comprising: 

means for analyzing, syntactically, said annotated words 
and outputting a sequence of words and a sequence of 
said part-of-speech corresponding to said sequence of 
words. 

35. A system according to claim 34, further comprises: 
means for analyzing, morphologically, said syntactically 
analyzed words by performing a morphological analysis of 
said words in accordance with said assigned parts of speech. 

36. A system according to claim 35, further comprising: 

means for assigning a sense to at least one of said 
morphological units to elucidate the translation of that 
morphological unit in the target language. 

37. A system according to claim 31, further comprising: 

means for rjerforming question inversion; 
means for performing negative coalescence; and 
means for rjerforming at least one of adverb, pronoun and 
adjective movement. 

38. A system according to claim 30, wherein said first 
transducing means further comprises; 

means for tokenizing the source text by identifying indi- 
vidual units of linguistic structure and unit separators 
and arranging said units of linguistic structure and said 
unit separators in a sequence. 

39. A system according to claim 38, further comprising: 

a case transducer means for assigning to each word a 
token and a case partem, said case pattern specifying a 
case of each letter of the word, and evaluating each case 
pattern to determine a true case pattern of the word. 



5,477,451 



107 



40. A system according to claim 39, further comprising: 

(a) means for aetermining whether said token is part of a 
name; 

(b) means for setting a true case according to a table of 5 
true cases for names if said token is part of a name; 

(c) means for determining whether said token is a member 
of a set of tokens which have only one true case if said 
token is not part of a name; 

(c) if the true case has not be determined by steps a and io 
b, and if said token is part of a name, then: 

(d) means for setting the true case equal to a most 
probable true case for that token if said token is part of 
a name; and 

(f) means for setting the true case equal to the case for that 15 
token if said token is not part of a name. 



108 



41. A system according to claim 30, further cornprising: 

means for expressing each of said intermediate target 
structures as an ordered sequence of units of linguistic 
structure, and means for multiplying conditional prob- 
abilities of said units within an intermediate target 
structure given an occurrence of previous units of 
linguistic structure within said intermediate target 
structure to obtain said first score. 

42. A system according to claim 41, wherein a conditional 
probability of each unit of each of said intermediate target 
structure depends only on a fixed number of preceding units 
within said intermediate target structure. 



UNITED STATES PATENT AND TRADEMARK OFFICE 

CERTIFICATE OF CORRECTION 



PATENT NO. : 5 ,477 ,451 

DATED December 19, 1995 

INVENTOR(S) : Peter F. Brown, er al. _ 

It is certified that error appears in the above-inderttffled patent and that said Letters Patent is hereby 
corrected as shown below: 

Column 99: 

Claim 1 , line 5, change "firs" to -first-; 
Column 100: 

Claim 6, line 10, change "a" to -an-- 



line 28 (column 101, line 5), change "leas tone" to -least 



one-; and 



Signed and Sealed this 
Eighteenth Day of June, 1996 



Attest: 




Attesting Officer 



Commissioner of Patents and Trademarks 



